You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@drill.apache.org by Bob Rudis <bo...@rud.is> on 2016/12/23 00:35:54 UTC

New Drill R "dplyr" interface

First time poster; short-time lurker.

I've built an alpha R package — sergeant —
https://github.com/hrbrmstr/sergeant — that provides direct access to
the Drill REST and JDBC interfaces and has a very nascent R "dplyr"
interface (if you know R you prbly know what that means).

Please kick the tyres and file github issues for features or bugs.

-Bob

Re: New Drill R "dplyr" interface

Posted by Nicolas Paris <ni...@gmail.com>.
2016-12-26 16:22 GMT+01:00 Bob Rudis <bo...@rud.is>:

> I guess I read more into "ANSI SQL" than I should have. While it does
> handle some complex SQL generated by dplyr (which you can see at
> https://github.com/hrbrmstr/sergeant if you scroll down) there are
> instances when if, say you try to filter and left join two different
> tables it generates valid SQL that Drill dislikes. You can see the
> dplyr that fails here: http://rpubs.com/hrbrmstr/drillex
>
> dplyr generates the following SQL:
>
>     SELECT *
>     FROM ( SELECT *
>            FROM ( SELECT *
>                   FROM  dfs.tmp.`/in/c.parquet`
>                   WHERE ( cyl  = 6.0 ) ) xpintqtlzw
>            LEFT JOIN dfs.tmp.`/in/b.json`
>            USING ( car ) ) fyxhnlrrca
>     LIMIT 10
>
> The error message (if one doesn't want to go to the rpubs URL) is:
>
>     SYSTEM ERROR: DrillRuntimeException: Join only supports implicit
> casts between
>     1. Numeric data
>     2. Varchar, Varbinary data
>     3. Date, Timestamp data Left type: INT, Right type: VARCHAR.
>     Add explicit casts to avoid this error
>
> "car" is the same in both data sets as it's just different column
> slices of the seminal mtcars data set but with a common `car` column.
> It fails if both files are the same file format.
>
> Also, when attempting to make a "data frame to parquet" function which
> sends a CTAS query with > 1000 (VALUES((1,2,3,...)) rows Drill tosses
> stack exceptions and hangs (in the embedded console, localhost console
> and REST POST query). Said operation is the only thing that really
> makes sense for such a convenience function since one could just
> `jsonlite::stream_out(...)` a data frame and CTAS from the console
> otherwise (an R wrapper for that makes little sense IMO). The SparkR
> `write.parquet()` actually calls a Java function to do the conversion.
> Note that this isn't a complaint, Drill is at heart an EDA/BI
> workhorse and while it provides the data format transformation
> functions such a use of VALUES is a bit extreme.
>

​In the meantime, is the "VALUES syntax" a good way ? I guess the planner
needs to read all the SQL before writing data. Maybe data-frame -> CSV ->
CTAS FROM csv is a better way in any cases ?
​


>
> On Sun, Dec 25, 2016 at 2:40 PM, Ted Dunning <te...@gmail.com>
> wrote:
> > On Sun, Dec 25, 2016 at 11:04 AM, Bob Rudis <bo...@rud.is> wrote:
> >
> >> Drill SQL seems to not be as "standard" as it claims (some
> >> operations that are valid SQL after a series of dplyr transforms  shld
> >> work but do not).
> >>
> >
> > Can you elaborate?  What operations?
> >
> > Also, Drill claims that it is as compatible as possible where it
> implements
> > some function. It doesn't claim to be complete.
>

Re: New Drill R "dplyr" interface

Posted by Bob Rudis <bo...@rud.is>.
Aye. I'll file an issue. I can definitely reproduce it in any Drill
context I have.

(I'm hyper-threading here as your CSV reply is later than this one).

If one has a CSV or JSON, then I'm not sure why one would use R for
the Drill machinations. Just as easy (if not easier) to fire off a
small script in `drill-localhost` or `drill-embedded` to do the work.
I'm still (it's prbly just my inability to grok this use case) not
sure why one wld turn an in-memory R data frame to a parquet file
unless one really works with multi-GB data frames directly in R (which
I have to then ask "why" when there's Drill, Spark, H2O, etc that do a
good amt of that work better, albeit lacking in many of the more
advanced stats/ML pkgs R has available to it).

On Mon, Dec 26, 2016 at 3:24 PM, Ted Dunning <te...@gmail.com> wrote:
> On Mon, Dec 26, 2016 at 7:22 AM, Bob Rudis <bo...@rud.is> wrote:
>
>> Also, when attempting to make a "data frame to parquet" function which
>> sends a CTAS query with > 1000 (VALUES((1,2,3,...)) rows Drill tosses
>> stack exceptions and hangs (in the embedded console, localhost console
>> and REST POST query).
>>
>
> Can you be more explicit here (and possibly file a bug on JIRA
> <https://issues.apache.org/jira/browse/DRILL/?selectedTab=com.atlassian.jira.jira-projects-plugin:summary-panel>
> )?
>
> I just tried this with 1100 rows and it worked fine.
>
> If you can attach an actual CTAS query to the JIRA, that would be awesome.

Re: New Drill R "dplyr" interface

Posted by Ted Dunning <te...@gmail.com>.
On Mon, Dec 26, 2016 at 7:22 AM, Bob Rudis <bo...@rud.is> wrote:

> Also, when attempting to make a "data frame to parquet" function which
> sends a CTAS query with > 1000 (VALUES((1,2,3,...)) rows Drill tosses
> stack exceptions and hangs (in the embedded console, localhost console
> and REST POST query).
>

Can you be more explicit here (and possibly file a bug on JIRA
<https://issues.apache.org/jira/browse/DRILL/?selectedTab=com.atlassian.jira.jira-projects-plugin:summary-panel>
)?

I just tried this with 1100 rows and it worked fine.

If you can attach an actual CTAS query to the JIRA, that would be awesome.

Re: New Drill R "dplyr" interface

Posted by Ted Dunning <te...@gmail.com>.
On Mon, Dec 26, 2016 at 7:22 AM, Bob Rudis <bo...@rud.is> wrote:

> I guess I read more into "ANSI SQL" than I should have.
>

Or not.  Could be a problem with Drill, too.


> While it does
> handle some complex SQL generated by dplyr (which you can see at
> https://github.com/hrbrmstr/sergeant
>

Btw... this is really pretty danged awesome.



> if you scroll down) there are
> instances when if, say you try to filter and left join two different
> tables it generates valid SQL that Drill dislikes. You can see the
> dplyr that fails here: http://rpubs.com/hrbrmstr/drillex
>
> dplyr generates the following SQL:
>
>     SELECT *
>     FROM ( SELECT *
>            FROM ( SELECT *
>                   FROM  dfs.tmp.`/in/c.parquet`
>                   WHERE ( cyl  = 6.0 ) ) xpintqtlzw
>            LEFT JOIN dfs.tmp.`/in/b.json`
>            USING ( car ) ) fyxhnlrrca
>     LIMIT 10
>
> The error message (if one doesn't want to go to the rpubs URL) is:
>
>     SYSTEM ERROR: DrillRuntimeException: Join only supports implicit
> casts between
>     1. Numeric data
>     2. Varchar, Varbinary data
>     3. Date, Timestamp data Left type: INT, Right type: VARCHAR.
>     Add explicit casts to avoid this error
>

I haven't been able to dig into this, but this looks pretty straightforward
to replicate.

Re: New Drill R "dplyr" interface

Posted by Bob Rudis <bo...@rud.is>.
I guess I read more into "ANSI SQL" than I should have. While it does
handle some complex SQL generated by dplyr (which you can see at
https://github.com/hrbrmstr/sergeant if you scroll down) there are
instances when if, say you try to filter and left join two different
tables it generates valid SQL that Drill dislikes. You can see the
dplyr that fails here: http://rpubs.com/hrbrmstr/drillex

dplyr generates the following SQL:

    SELECT *
    FROM ( SELECT *
           FROM ( SELECT *
                  FROM  dfs.tmp.`/in/c.parquet`
                  WHERE ( cyl  = 6.0 ) ) xpintqtlzw
           LEFT JOIN dfs.tmp.`/in/b.json`
           USING ( car ) ) fyxhnlrrca
    LIMIT 10

The error message (if one doesn't want to go to the rpubs URL) is:

    SYSTEM ERROR: DrillRuntimeException: Join only supports implicit
casts between
    1. Numeric data
    2. Varchar, Varbinary data
    3. Date, Timestamp data Left type: INT, Right type: VARCHAR.
    Add explicit casts to avoid this error

"car" is the same in both data sets as it's just different column
slices of the seminal mtcars data set but with a common `car` column.
It fails if both files are the same file format.

Also, when attempting to make a "data frame to parquet" function which
sends a CTAS query with > 1000 (VALUES((1,2,3,...)) rows Drill tosses
stack exceptions and hangs (in the embedded console, localhost console
and REST POST query). Said operation is the only thing that really
makes sense for such a convenience function since one could just
`jsonlite::stream_out(...)` a data frame and CTAS from the console
otherwise (an R wrapper for that makes little sense IMO). The SparkR
`write.parquet()` actually calls a Java function to do the conversion.
Note that this isn't a complaint, Drill is at heart an EDA/BI
workhorse and while it provides the data format transformation
functions such a use of VALUES is a bit extreme.

On Sun, Dec 25, 2016 at 2:40 PM, Ted Dunning <te...@gmail.com> wrote:
> On Sun, Dec 25, 2016 at 11:04 AM, Bob Rudis <bo...@rud.is> wrote:
>
>> Drill SQL seems to not be as "standard" as it claims (some
>> operations that are valid SQL after a series of dplyr transforms  shld
>> work but do not).
>>
>
> Can you elaborate?  What operations?
>
> Also, Drill claims that it is as compatible as possible where it implements
> some function. It doesn't claim to be complete.

Re: New Drill R "dplyr" interface

Posted by Ted Dunning <te...@gmail.com>.
On Sun, Dec 25, 2016 at 11:04 AM, Bob Rudis <bo...@rud.is> wrote:

> Drill SQL seems to not be as "standard" as it claims (some
> operations that are valid SQL after a series of dplyr transforms  shld
> work but do not).
>

Can you elaborate?  What operations?

Also, Drill claims that it is as compatible as possible where it implements
some function. It doesn't claim to be complete.

Re: New Drill R "dplyr" interface

Posted by Bob Rudis <bo...@rud.is>.
I'll be working on adding said functionality (data.frame to parquet)
in the near future (though the use-case is still a bit lost on me it's
not the first request for it I've seen). I just revamped the dplyr
interface (it's almost complete now apart from some details around
quoting). Drill SQL seems to not be as "standard" as it claims (some
operations that are valid SQL after a series of dplyr transforms  shld
work but do not).


On Fri, Dec 23, 2016 at 5:40 PM, Nicolas Paris <ni...@gmail.com> wrote:
> 2016-12-23 6:45 GMT+01:00 Ted Dunning <te...@gmail.com>:
>
>> Very cool!
>>
>>
>>
>> On Thu, Dec 22, 2016 at 4:35 PM, Bob Rudis <bo...@rud.is> wrote:
>>
>> > First time poster; short-time lurker.
>> >
>> > I've built an alpha R package — sergeant —
>> > https://github.com/hrbrmstr/sergeant — that provides direct access to
>> > the Drill REST and JDBC interfaces and has a very nascent R "dplyr"
>> > interface (if you know R you prbly know what that means).
>> >
>>
>
>
> Hi,
>
> Does your package gives the ability to write an R data.frame into Parquet
> thanks to Drill CTAS too ?
>
>
>
>> > Please kick the tyres and file github issues for features or bugs.
>> >
>> > -Bob
>> >
>>

Re: New Drill R "dplyr" interface

Posted by Nicolas Paris <ni...@gmail.com>.
2016-12-23 6:45 GMT+01:00 Ted Dunning <te...@gmail.com>:

> Very cool!
>
>
>
> On Thu, Dec 22, 2016 at 4:35 PM, Bob Rudis <bo...@rud.is> wrote:
>
> > First time poster; short-time lurker.
> >
> > I've built an alpha R package — sergeant —
> > https://github.com/hrbrmstr/sergeant — that provides direct access to
> > the Drill REST and JDBC interfaces and has a very nascent R "dplyr"
> > interface (if you know R you prbly know what that means).
> >
>

​​
Hi,

Does your package gives the ability to write an R data.frame into Parquet
thanks to Drill CTAS too ?​​



> > Please kick the tyres and file github issues for features or bugs.
> >
> > -Bob
> >
>

Re: New Drill R "dplyr" interface

Posted by Ted Dunning <te...@gmail.com>.
Very cool!



On Thu, Dec 22, 2016 at 4:35 PM, Bob Rudis <bo...@rud.is> wrote:

> First time poster; short-time lurker.
>
> I've built an alpha R package — sergeant —
> https://github.com/hrbrmstr/sergeant — that provides direct access to
> the Drill REST and JDBC interfaces and has a very nascent R "dplyr"
> interface (if you know R you prbly know what that means).
>
> Please kick the tyres and file github issues for features or bugs.
>
> -Bob
>