You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@drill.apache.org by Paul Rogers <pr...@mapr.com> on 2017/06/05 18:59:24 UTC

Thinking about Drill 2.0

Hi All,

A while back there was a discussion about the scope of Drill 2.0. Got me thinking about possible topics. My two cents:

Drill 2.0 should focus on making Drill’s external APIs production ready. This means five things:

* Clearly identify and define each API.
* (Re)design each API to ensure it fully isolates the client from Drill internals.
* Ensure the API allows full version compatibility: Allow mixing of old/new clients and servers with some limits.
* Fully test each API.
* Fully document each API.

Once client code is isolated from Drill internals, we are free to evolve the internals in either Drill 2.0 or a later release.

In my mind, the top APIs to revisit are:

* The drill client API.
* The storage plugin API.

(Explanation below.)

What other APIs should we consider? Here are some examples, please suggest items you know about:

* Command line scripts and arguments
* REST API
* Names and contents of system tables
* Structure of the storage plugin configuration JSON
* Structure of the query profile
* Structure of the EXPLAIN PLAN output.
* Semantics of Drill functions, such as the date functions recently partially fixed by adding “ANSI” alternatives.
* Naming of config and system/session options.
* (Your suggestions here…)

I’ve taken the liberty of moving some API-breaking tickets in the Apache Drill JIRA to 2.0. Perhaps we can add others so that we have a good inventory of 2.0 candidates.

Here are the reasons for my two suggestions.

Today, we expose Drill value vectors to the client. This means if we want to enhance anything about Drill’s internal memory format (i.e. value vectors, such as a possible move to Arrow), we break compatibility with old clients. Using value vectors also means we need a very large percentage of Drill’s internal code on the client in Java or C++. We are learning that doing so is a challenge.

A new client API should follow established SQL database tradition: a synchronous, row-based API designed for versioning, for forward and backward compatibility, and to support ODBC and JDBC users.

We can certainly maintain the existing full, async, heavy-weight client for our tests and for applications that would benefit from it.

Once we define a new API, we are free to alter Drill’s value vectors to, say, add the needed null states to fully support JSON, to change offset vectors to not need n+1 values (which doubles vector size in 64K batches), and so on. Since vectors become private to Drill (or Arrow) after the new client API, we are free to innovate to improve them.

Similarly, the storage plugin API exposes details of Calcite (which seems to evolve with each new version), exposes value vector implementations, and so on. A cleaner, simpler, more isolated API will allow storage plugins to be built faster, but will also isolate them from Drill internals changes. Without isolation, each change to Drill internals would require plugin authors to update their plugin before Drill can be released.

Thoughts? Suggestions?

Thanks,

- Paul

Re: Thinking about Drill 2.0

Posted by Parth Chandra <pa...@apache.org>.

Some good work has been done in Arrow to get to a more formalized
representation of complex types (lists and maps) particularly trying to
address the nullability issues. My recommendation would be to get to a
reasonable level of integration with Arrow and then start submitting
changes/patches to Arrow as we need them in Drill. Arrow is moving faster
being at an earlier stage, so this approach is unlikely to hold us up.
It is also critical that we establish performance baselines before
switching to Arrow. We're hoping for improvement but must guard against
possible regressions.



On Wed, Jun 7, 2017 at 1:53 PM, Julien Le Dem <ju...@ledem.net> wrote:

> Hi Paul,
> My 2ct regarding Arrow:
> The goal of Arrow is to be a standard representation that does not break
> compatibility in the future.
> If moving to Arrow is a breaking change, It don’t think it makes sense to
> abstract it out to present a row oriented representation to the client. I
> defeats the purpose.
> You can still use Arrow as your standard representation to the client and
> allow for custom vectors on the server side that get converted before
> sending. This sounds like it could be part of the smaller API you are
> taking about.
> As for backward compatibility with the Drill ValueVectors it is possible
> to make a compatibility layer that patches the few differences (byte
> instead of bits for nullability, some type width difference) with little
> code.
> For changing the offset vectors it would be great to have this discussion
> on the Arrow mailing list so that we don’t diverge. (one simple workaround
> seems to use 64K-1 batches?)
> Some work has been done on the arrow side regarding json support (for
> example maps are now nullable: ARROW-274)
> Cheers
> Julien
>
> > On Jun 5, 2017, at 11:59 AM, Paul Rogers <pr...@mapr.com> wrote:
> >
> > Hi All,
> >
> > A while back there was a discussion about the scope of Drill 2.0. Got me
> thinking about possible topics. My two cents:
> >
> > Drill 2.0 should focus on making Drill’s external APIs production ready.
> This means five things:
> >
> > * Clearly identify and define each API.
> > * (Re)design each API to ensure it fully isolates the client from Drill
> internals.
> > * Ensure the API allows full version compatibility: Allow mixing of
> old/new clients and servers with some limits.
> > * Fully test each API.
> > * Fully document each API.
> >
> > Once client code is isolated from Drill internals, we are free to evolve
> the internals in either Drill 2.0 or a later release.
> >
> > In my mind, the top APIs to revisit are:
> >
> > * The drill client API.
> > * The storage plugin API.
> >
> > (Explanation below.)
> >
> > What other APIs should we consider? Here are some examples, please
> suggest items you know about:
> >
> > * Command line scripts and arguments
> > * REST API
> > * Names and contents of system tables
> > * Structure of the storage plugin configuration JSON
> > * Structure of the query profile
> > * Structure of the EXPLAIN PLAN output.
> > * Semantics of Drill functions, such as the date functions recently
> partially fixed by adding “ANSI” alternatives.
> > * Naming of config and system/session options.
> > * (Your suggestions here…)
> >
> > I’ve taken the liberty of moving some API-breaking tickets in the Apache
> Drill JIRA to 2.0. Perhaps we can add others so that we have a good
> inventory of 2.0 candidates.
> >
> > Here are the reasons for my two suggestions.
> >
> > Today, we expose Drill value vectors to the client. This means if we
> want to enhance anything about Drill’s internal memory format (i.e. value
> vectors, such as a possible move to Arrow), we break compatibility with old
> clients. Using value vectors also means we need a very large percentage of
> Drill’s internal code on the client in Java or C++. We are learning that
> doing so is a challenge.
> >
> > A new client API should follow established SQL database tradition: a
> synchronous, row-based API designed for versioning, for forward and
> backward compatibility, and to support ODBC and JDBC users.
> >
> > We can certainly maintain the existing full, async, heavy-weight client
> for our tests and for applications that would benefit from it.
> >
> > Once we define a new API, we are free to alter Drill’s value vectors to,
> say, add the needed null states to fully support JSON, to change offset
> vectors to not need n+1 values (which doubles vector size in 64K batches),
> and so on. Since vectors become private to Drill (or Arrow) after the new
> client API, we are free to innovate to improve them.
> >
> > Similarly, the storage plugin API exposes details of Calcite (which
> seems to evolve with each new version), exposes value vector
> implementations, and so on. A cleaner, simpler, more isolated API will
> allow storage plugins to be built faster, but will also isolate them from
> Drill internals changes. Without isolation, each change to Drill internals
> would require plugin authors to update their plugin before Drill can be
> released.
> >
> > Thoughts? Suggestions?
> >
> > Thanks,
> >
> > - Paul
>
>

Re: Thinking about Drill 2.0

Posted by Julien Le Dem <ju...@ledem.net>.

Hi Paul,
My 2ct regarding Arrow:
The goal of Arrow is to be a standard representation that does not break compatibility in the future.
If moving to Arrow is a breaking change, It don’t think it makes sense to abstract it out to present a row oriented representation to the client. I defeats the purpose.
You can still use Arrow as your standard representation to the client and allow for custom vectors on the server side that get converted before sending. This sounds like it could be part of the smaller API you are taking about.
As for backward compatibility with the Drill ValueVectors it is possible to make a compatibility layer that patches the few differences (byte instead of bits for nullability, some type width difference) with little code.
For changing the offset vectors it would be great to have this discussion on the Arrow mailing list so that we don’t diverge. (one simple workaround seems to use 64K-1 batches?)
Some work has been done on the arrow side regarding json support (for example maps are now nullable: ARROW-274)
Cheers
Julien

> On Jun 5, 2017, at 11:59 AM, Paul Rogers <pr...@mapr.com> wrote:
> 
> Hi All,
> 
> A while back there was a discussion about the scope of Drill 2.0. Got me thinking about possible topics. My two cents:
> 
> Drill 2.0 should focus on making Drill’s external APIs production ready. This means five things:
> 
> * Clearly identify and define each API.
> * (Re)design each API to ensure it fully isolates the client from Drill internals.
> * Ensure the API allows full version compatibility: Allow mixing of old/new clients and servers with some limits.
> * Fully test each API.
> * Fully document each API.
> 
> Once client code is isolated from Drill internals, we are free to evolve the internals in either Drill 2.0 or a later release.
> 
> In my mind, the top APIs to revisit are:
> 
> * The drill client API.
> * The storage plugin API.
> 
> (Explanation below.)
> 
> What other APIs should we consider? Here are some examples, please suggest items you know about:
> 
> * Command line scripts and arguments
> * REST API
> * Names and contents of system tables
> * Structure of the storage plugin configuration JSON
> * Structure of the query profile
> * Structure of the EXPLAIN PLAN output.
> * Semantics of Drill functions, such as the date functions recently partially fixed by adding “ANSI” alternatives.
> * Naming of config and system/session options.
> * (Your suggestions here…)
> 
> I’ve taken the liberty of moving some API-breaking tickets in the Apache Drill JIRA to 2.0. Perhaps we can add others so that we have a good inventory of 2.0 candidates.
> 
> Here are the reasons for my two suggestions.
> 
> Today, we expose Drill value vectors to the client. This means if we want to enhance anything about Drill’s internal memory format (i.e. value vectors, such as a possible move to Arrow), we break compatibility with old clients. Using value vectors also means we need a very large percentage of Drill’s internal code on the client in Java or C++. We are learning that doing so is a challenge.
> 
> A new client API should follow established SQL database tradition: a synchronous, row-based API designed for versioning, for forward and backward compatibility, and to support ODBC and JDBC users.
> 
> We can certainly maintain the existing full, async, heavy-weight client for our tests and for applications that would benefit from it.
> 
> Once we define a new API, we are free to alter Drill’s value vectors to, say, add the needed null states to fully support JSON, to change offset vectors to not need n+1 values (which doubles vector size in 64K batches), and so on. Since vectors become private to Drill (or Arrow) after the new client API, we are free to innovate to improve them.
> 
> Similarly, the storage plugin API exposes details of Calcite (which seems to evolve with each new version), exposes value vector implementations, and so on. A cleaner, simpler, more isolated API will allow storage plugins to be built faster, but will also isolate them from Drill internals changes. Without isolation, each change to Drill internals would require plugin authors to update their plugin before Drill can be released.
> 
> Thoughts? Suggestions?
> 
> Thanks,
> 
> - Paul

Re: Thinking about Drill 2.0

Posted by Julian Hyde <jh...@apache.org>.

Avatica?

> On Jun 15, 2017, at 10:39 AM, Paul Rogers <pr...@mapr.com> wrote:
> 
> Hi Uwe,
> 
> This is incredibly helpful information! You explanation makes perfect sense.
> 
> We work quite a bit with ODBC and JDBC: two interfaces that are very much synchronous and row-based. There are three challenges key with working with Drill:
> 
> * Drill results are columnar, requiring a column-to-row translation for xDBC
> * Drill uses an asynchronous API, while JDBC and ODBC are synchronous, resulting in an async-to-sync API translation.
> * The JDBC API is based on the Drill client which requires quite a bit (almost all, really) of Drill code.
> 
> The thought is to create a new API that serves the need of ODBC and JDBC, but without the complexity (while, of course, preserving the existing client for other uses.) Said another way, find a way to keep the xDBC interfaces simple so that they don’t take quite so much space in the client, and don’t require quite so much work to maintain.
> 
> The first issue (row vs. columnar) turns out to not be a huge issue, the columnar-to-row translation code exists and works. The real issue is allowing the client to the size of the data sent from the server. (At present, the server decides the “batch” size, and sometimes the size is huge.) So, we can just focus on controlling batch size (and thus client buffer allocations), but retain the columnar form, even for ODBC and JDBC.
> 
> So, for the Pandas use case, does your code allow (or benefit from) multiple simultaneous queries over the same connection? Or, since Python seems to be only approximately multi-threaded, would a synchronous, columnar API work better? Here I just mean, in a single connection, is there a need to run multiple concurrent queries, or is the classic one-concurrent-query-per-connection model easier for Python to consume?
> 
> Another point you raise is that our client-side column format should be Arrow, or Arrow-compatible. (That is, either using Arrow code, or the same data format as Arrow.) That way users of your work can easily leverage Drill.
> 
> This last question raises an interesting issue that I (at least) need to understand more clearly. Is Arrow a data format + code? Or, is the data format one aspect of Arrow, and the implementation another? Would be great to have a common data format, but as we squeeze ever more performance from Drill, we find we have to very carefully tune our data manipulation code for the specific needs of Drill queries. I wonder how we’d do that if we switched to using Arrow’s generic vector implementation code? Has anyone else wrestled with this question for your project?
> 
> Thanks,
> 
> - Paul
> 
> 
>> On Jun 15, 2017, at 12:23 AM, Uwe L. Korn <uw...@xhochy.com> wrote:
>> 
>> Hello Paul,
>> 
>> Bringing in a bit of the perspective partly of an Arrow developer but mostly someone that works quite a lot in Python with the respective data libraries there: In Python all (performant) data chrunching work is done on columar representations. While this is partly due to columnar being a more CPU efficient on these tasks, this is also because columnar can be abstracted in a form that you implement all computational work with C/C++ or an LLVM-based JIT while still keeping clear and understandable interfaces in Python. In the end to make an efficient Python support, we will always have to convert into a columnar representation, making row-wise APIs to a system that is internally columnar quite annoying as we have a lot of wastage in the conversion layer. In the case that one would want to provide the ability to support Python UDFs, this would lead to the situation that in most cases the UDF calls will be greatly dominated by the conversion logic.
>> 
>> For the actual performance differences that this makes, you can have a look at the work that recently is happening in Apache Spark where Arrow is used for the conversion of the result from Spark's internal JVM data structures into typical Python ones ("Pandas DataFrames"). In comparision to the existing conversion, this sees currently a speedup of 40x but will be even higher once further steps are implemented. Julien should be able to provide a link to slides that outline the work better.
>> 
>> As I'm quite new to Drill, I cannot go into much further details w.r.t. Drill but be aware that for languages like Python, having a columnar API really matters. While Drill integrates with Python at the moment not really as a first class citizen, moving to row-wise APIs won't probably make a difference to the current situation but good columnar APIs would help us to keep the path open for the future.
>> 
>> Uwe
>

Re: Thinking about Drill 2.0

Posted by "Uwe L. Korn" <uw...@xhochy.com>.

Hello Paul,

see inline comments. 

Keep in mind that I'm new to Drill and still at the stage of only using it experimentally. Still, we have quite some relational DB usage in our workflows, so the same problems will apply to Drill as the interfaces are all quite common. While I have different problems I use DBs/query engines for, the main workflow I need columnar data structures for is during machine learning model development / usage: join different tables & data sets; apply transformations using SQL; ingest this data; prepare data further in Python; feed to machine learning pipeline.

> Am 15.06.2017 um 19:39 schrieb Paul Rogers <pr...@mapr.com>:
> 
> Hi Uwe,
> 
> This is incredibly helpful information! You explanation makes perfect sense.
> 
> We work quite a bit with ODBC and JDBC: two interfaces that are very much synchronous and row-based.

I have not yet dealt directly with the ODBC API itself directly, only abstractions on it but at least from what I have heard, it should support a mode where the results are returned in a columnar fashion. We have therefore developed https://github.com/blue-yonder/turbodbc to have a columnar ODBC interface in Python / Pandas. Before that, we often had the problem that using the CSV export from the DB and then parsing the CSV was much faster as the Python ODBC libraries. There are some quite efficient CSV parsers around that are magnitudes faster then pure-python row-to-column conversion code.

> There are three challenges key with working with Drill:
> 
> * Drill results are columnar, requiring a column-to-row translation for xDBC
> * Drill uses an asynchronous API, while JDBC and ODBC are synchronous, resulting in an async-to-sync API translation.
> * The JDBC API is based on the Drill client which requires quite a bit (almost all, really) of Drill code.
> 
> The thought is to create a new API that serves the need of ODBC and JDBC, but without the complexity (while, of course, preserving the existing client for other uses.) Said another way, find a way to keep the xDBC interfaces simple so that they don’t take quite so much space in the client, and don’t require quite so much work to maintain.
> 
> The first issue (row vs. columnar) turns out to not be a huge issue, the columnar-to-row translation code exists and works. The real issue is allowing the client to the size of the data sent from the server. (At present, the server decides the “batch” size, and sometimes the size is huge.) So, we can just focus on controlling batch size (and thus client buffer allocations), but retain the columnar form, even for ODBC and JDBC.
> 
> So, for the Pandas use case, does your code allow (or benefit from) multiple simultaneous queries over the same connection? Or, since Python seems to be only approximately multi-threaded, would a synchronous, columnar API work better? Here I just mean, in a single connection, is there a need to run multiple concurrent queries, or is the classic one-concurrent-query-per-connection model easier for Python to consume?

This greatly depends on the workflow. I could image that for the applications where you start a large query that only produces a few rows as result, it would be worth to start multiple of these at once and let the client wait asynchronously for all of them. For me, as a data scientist, most queries return a rather large result set. I would typically join data sources together and do some transformations on the data with SQL but in the end, I would have a single table as result with which a longer running machine learning pipeline is then fed. For this, mutliple connections or multiple queries are not relevant but due to the size of the result, the time for the serializations between ODBC, Pandas/Arrow and Python structures is the more relevantant time component. In contrast to the query which may scale over a large number of workers, the serialization is done on a single worker at the end.

> Another point you raise is that our client-side column format should be Arrow, or Arrow-compatible. (That is, either using Arrow code, or the same data format as Arrow.) That way users of your work can easily leverage Drill.

Recently we have added an Apache Arrow interface to Turbodbc: http://arrow.apache.org/blog/2017/06/16/turbodbc-arrow/ While that still gets the data via the traditional ODBC interface and does need some transformations, the data that is passed from C++ to Python is structured in a way that doesn't need to transformed again in Python (which is the typical problem with traditional Python DB interfaces that return row-wise Python objects as results). Having a client that would already return Arrow structure instead of the ODBC ones would save at least one memory copy.

> This last question raises an interesting issue that I (at least) need to understand more clearly. Is Arrow a data format + code? Or, is the data format one aspect of Arrow, and the implementation another? Would be great to have a common data format, but as we squeeze ever more performance from Drill, we find we have to very carefully tune our data manipulation code for the specific needs of Drill queries. I wonder how we’d do that if we switched to using Arrow’s generic vector implementation code? Has anyone else wrestled with this question for your project?

Arrow is at its core a data format / specification for an in-memory representation of columnar data. In addition, it also brings Java / C++ / Python / C-GLib / JavaScript implementations of these data structures and helper functions to construct these functions. As I'm only involved in the C++ / Python side of things, I cannot really tell much detail about the Java implementation but as this one was forked out of Drill, it should still be very close to the ValueVectors. But as an C++ consumer, for me it only matters that the format is respected, not which specific (code) implementation is used on the producer side.

Main takeaways: 
 * My queries produce rather large results (1-10 million rows)
 * These results are consumed by a single worker. If there were more consuming workers, the query results would scale with the number of workers, i.e. each worker would still receive 1-10 million rows.
 * Serialization / Transformation cost in the client code is my main concern. To effectively use Python on large data, I need columnar data, having a row-wise API may be the simpler interface but is very costly in the end.
 * If we would extended this to UDFs in Python, they would have the same serialization/transformation problem as the client code, just on a smaller number of rows.

Uwe

> 
> Thanks,
> 
> - Paul
> 
> 
>> On Jun 15, 2017, at 12:23 AM, Uwe L. Korn <uw...@xhochy.com> wrote:
>> 
>> Hello Paul,
>> 
>> Bringing in a bit of the perspective partly of an Arrow developer but mostly someone that works quite a lot in Python with the respective data libraries there: In Python all (performant) data chrunching work is done on columar representations. While this is partly due to columnar being a more CPU efficient on these tasks, this is also because columnar can be abstracted in a form that you implement all computational work with C/C++ or an LLVM-based JIT while still keeping clear and understandable interfaces in Python. In the end to make an efficient Python support, we will always have to convert into a columnar representation, making row-wise APIs to a system that is internally columnar quite annoying as we have a lot of wastage in the conversion layer. In the case that one would want to provide the ability to support Python UDFs, this would lead to the situation that in most cases the UDF calls will be greatly dominated by the conversion logic.
>> 
>> For the actual performance differences that this makes, you can have a look at the work that recently is happening in Apache Spark where Arrow is used for the conversion of the result from Spark's internal JVM data structures into typical Python ones ("Pandas DataFrames"). In comparision to the existing conversion, this sees currently a speedup of 40x but will be even higher once further steps are implemented. Julien should be able to provide a link to slides that outline the work better.
>> 
>> As I'm quite new to Drill, I cannot go into much further details w.r.t. Drill but be aware that for languages like Python, having a columnar API really matters. While Drill integrates with Python at the moment not really as a first class citizen, moving to row-wise APIs won't probably make a difference to the current situation but good columnar APIs would help us to keep the path open for the future.
>> 
>> Uwe
>

Re: Thinking about Drill 2.0

Posted by Paul Rogers <pr...@mapr.com>.

Hi Uwe,

This is incredibly helpful information! You explanation makes perfect sense.

We work quite a bit with ODBC and JDBC: two interfaces that are very much synchronous and row-based. There are three challenges key with working with Drill:

* Drill results are columnar, requiring a column-to-row translation for xDBC
* Drill uses an asynchronous API, while JDBC and ODBC are synchronous, resulting in an async-to-sync API translation.
* The JDBC API is based on the Drill client which requires quite a bit (almost all, really) of Drill code.

The thought is to create a new API that serves the need of ODBC and JDBC, but without the complexity (while, of course, preserving the existing client for other uses.) Said another way, find a way to keep the xDBC interfaces simple so that they don’t take quite so much space in the client, and don’t require quite so much work to maintain.

The first issue (row vs. columnar) turns out to not be a huge issue, the columnar-to-row translation code exists and works. The real issue is allowing the client to the size of the data sent from the server. (At present, the server decides the “batch” size, and sometimes the size is huge.) So, we can just focus on controlling batch size (and thus client buffer allocations), but retain the columnar form, even for ODBC and JDBC.

So, for the Pandas use case, does your code allow (or benefit from) multiple simultaneous queries over the same connection? Or, since Python seems to be only approximately multi-threaded, would a synchronous, columnar API work better? Here I just mean, in a single connection, is there a need to run multiple concurrent queries, or is the classic one-concurrent-query-per-connection model easier for Python to consume?

Another point you raise is that our client-side column format should be Arrow, or Arrow-compatible. (That is, either using Arrow code, or the same data format as Arrow.) That way users of your work can easily leverage Drill.

This last question raises an interesting issue that I (at least) need to understand more clearly. Is Arrow a data format + code? Or, is the data format one aspect of Arrow, and the implementation another? Would be great to have a common data format, but as we squeeze ever more performance from Drill, we find we have to very carefully tune our data manipulation code for the specific needs of Drill queries. I wonder how we’d do that if we switched to using Arrow’s generic vector implementation code? Has anyone else wrestled with this question for your project?

Thanks,

- Paul

> On Jun 15, 2017, at 12:23 AM, Uwe L. Korn <uw...@xhochy.com> wrote:
> 
> Hello Paul,
> 
> Bringing in a bit of the perspective partly of an Arrow developer but mostly someone that works quite a lot in Python with the respective data libraries there: In Python all (performant) data chrunching work is done on columar representations. While this is partly due to columnar being a more CPU efficient on these tasks, this is also because columnar can be abstracted in a form that you implement all computational work with C/C++ or an LLVM-based JIT while still keeping clear and understandable interfaces in Python. In the end to make an efficient Python support, we will always have to convert into a columnar representation, making row-wise APIs to a system that is internally columnar quite annoying as we have a lot of wastage in the conversion layer. In the case that one would want to provide the ability to support Python UDFs, this would lead to the situation that in most cases the UDF calls will be greatly dominated by the conversion logic.
> 
> For the actual performance differences that this makes, you can have a look at the work that recently is happening in Apache Spark where Arrow is used for the conversion of the result from Spark's internal JVM data structures into typical Python ones ("Pandas DataFrames"). In comparision to the existing conversion, this sees currently a speedup of 40x but will be even higher once further steps are implemented. Julien should be able to provide a link to slides that outline the work better.
> 
> As I'm quite new to Drill, I cannot go into much further details w.r.t. Drill but be aware that for languages like Python, having a columnar API really matters. While Drill integrates with Python at the moment not really as a first class citizen, moving to row-wise APIs won't probably make a difference to the current situation but good columnar APIs would help us to keep the path open for the future.
> 
> Uwe

Re: Thinking about Drill 2.0

Posted by "Uwe L. Korn" <uw...@xhochy.com>.

Hello Paul,

Bringing in a bit of the perspective partly of an Arrow developer but mostly someone that works quite a lot in Python with the respective data libraries there: In Python all (performant) data chrunching work is done on columar representations. While this is partly due to columnar being a more CPU efficient on these tasks, this is also because columnar can be abstracted in a form that you implement all computational work with C/C++ or an LLVM-based JIT while still keeping clear and understandable interfaces in Python. In the end to make an efficient Python support, we will always have to convert into a columnar representation, making row-wise APIs to a system that is internally columnar quite annoying as we have a lot of wastage in the conversion layer. In the case that one would want to provide the ability to support Python UDFs, this would lead to the situation that in most cases the UDF calls will be greatly dominated by the conversion logic.

For the actual performance differences that this makes, you can have a look at the work that recently is happening in Apache Spark where Arrow is used for the conversion of the result from Spark's internal JVM data structures into typical Python ones ("Pandas DataFrames"). In comparision to the existing conversion, this sees currently a speedup of 40x but will be even higher once further steps are implemented. Julien should be able to provide a link to slides that outline the work better.

As I'm quite new to Drill, I cannot go into much further details w.r.t. Drill but be aware that for languages like Python, having a columnar API really matters. While Drill integrates with Python at the moment not really as a first class citizen, moving to row-wise APIs won't probably make a difference to the current situation but good columnar APIs would help us to keep the path open for the future.

Uwe

> Am 13.06.2017 um 06:11 schrieb Paul Rogers <pr...@mapr.com>:
> 
> Thanks for the suggestions!
> 
> The issue is only partly Calcite changes. The real challenge for potential contributors is that the Drill storage plugin exposes Calcite mechanisms directly. That is, to write storage plugin, one must know (or, more likely, experiment to learn) the odd set of calls made to the storage plugin, for a group scan, then a sub scan, then this or that. Then, learning those calls, map what you want to do to those calls. In some cases, as Calcite chugs along, it calls the same methods multiple times, so the plugin writer has to be prepared to implement caching to avoid banging on the underlying system multiple times for the same data.
> 
> The key opportunity here is to observe that the current API is at the implementation level: as callbacks from Calcite. (Though, the Drill “easy” storage plugin does hide some of the details.) Instead, we’d like an API at the definition level: that the plugin simply declares that, say, it can return a schema, or can handle certain kinds of filter push-down, etc.
> 
> If we can define that API at the metadata (planning) level, then we can create an adapter between that API and Calcite. Doing so makes it much easier to test the plugin, and isolates the plugin from future code changes as Calcite evolves and improves: the adapter changes but not the plugin metadata API.
> 
> As you suggest, the resulting definition API would be handy to share between projects.
> 
> On the execution side, however, Drill plugins are very specific to Drill’s operator framework, Drill’s schema-on-read mechanism, Drill’s special columns (file metadata, partitions), Drill’s vector “mutators” and so on. Here, any synergy would be with Arrow to define a common “mutator” API so that a “row batch reader” written for one system should work with the other.
> 
> In any case, this kind of sharing is hard to define up front, we might instead keep the discussion going to see what works for Drill, what we can abstract out, and how we can make the common abstraction work for other systems beyond Drill.
> 
> Thanks,
> 
> - Paul
> 
>> On Jun 9, 2017, at 3:38 PM, Julian Hyde <jh...@apache.org> wrote:
>> 
>> 
>>> On Jun 5, 2017, at 11:59 AM, Paul Rogers <pr...@mapr.com> wrote:
>>> 
>>> Similarly, the storage plugin API exposes details of Calcite (which seems to evolve with each new version), exposes value vector implementations, and so on. A cleaner, simpler, more isolated API will allow storage plugins to be built faster, but will also isolate them from Drill internals changes. Without isolation, each change to Drill internals would require plugin authors to update their plugin before Drill can be released.
>> 
>> Sorry you’re getting burned by Calcite changes. We try to minimize impact, but sometimes it’s difficult to see what you’re breaking.
>> 
>> I like the goal of a stable storage plugin API. Maybe it’s something Drill and Calcite can collaborate on? Much of the DNA of an adapter is independent of the engine that will consume the data (Drill or otherwise) - it concerns how to create a connection, getting metadata, and pushing down logical operations, and generating queries in the target system’s query language. Calcite and Drill ought to be able to share that part, rather than maintaining separate collections of adapters.
>> 
>> Julian
>> 
>

Re: Thinking about Drill 2.0

Posted by Paul Rogers <pr...@mapr.com>.

Thanks for the suggestions!

The issue is only partly Calcite changes. The real challenge for potential contributors is that the Drill storage plugin exposes Calcite mechanisms directly. That is, to write storage plugin, one must know (or, more likely, experiment to learn) the odd set of calls made to the storage plugin, for a group scan, then a sub scan, then this or that. Then, learning those calls, map what you want to do to those calls. In some cases, as Calcite chugs along, it calls the same methods multiple times, so the plugin writer has to be prepared to implement caching to avoid banging on the underlying system multiple times for the same data.

The key opportunity here is to observe that the current API is at the implementation level: as callbacks from Calcite. (Though, the Drill “easy” storage plugin does hide some of the details.) Instead, we’d like an API at the definition level: that the plugin simply declares that, say, it can return a schema, or can handle certain kinds of filter push-down, etc.

If we can define that API at the metadata (planning) level, then we can create an adapter between that API and Calcite. Doing so makes it much easier to test the plugin, and isolates the plugin from future code changes as Calcite evolves and improves: the adapter changes but not the plugin metadata API.

As you suggest, the resulting definition API would be handy to share between projects.

On the execution side, however, Drill plugins are very specific to Drill’s operator framework, Drill’s schema-on-read mechanism, Drill’s special columns (file metadata, partitions), Drill’s vector “mutators” and so on. Here, any synergy would be with Arrow to define a common “mutator” API so that a “row batch reader” written for one system should work with the other.

In any case, this kind of sharing is hard to define up front, we might instead keep the discussion going to see what works for Drill, what we can abstract out, and how we can make the common abstraction work for other systems beyond Drill.

Thanks,

- Paul

> On Jun 9, 2017, at 3:38 PM, Julian Hyde <jh...@apache.org> wrote:
> 
> 
>> On Jun 5, 2017, at 11:59 AM, Paul Rogers <pr...@mapr.com> wrote:
>> 
>> Similarly, the storage plugin API exposes details of Calcite (which seems to evolve with each new version), exposes value vector implementations, and so on. A cleaner, simpler, more isolated API will allow storage plugins to be built faster, but will also isolate them from Drill internals changes. Without isolation, each change to Drill internals would require plugin authors to update their plugin before Drill can be released.
> 
> Sorry you’re getting burned by Calcite changes. We try to minimize impact, but sometimes it’s difficult to see what you’re breaking.
> 
> I like the goal of a stable storage plugin API. Maybe it’s something Drill and Calcite can collaborate on? Much of the DNA of an adapter is independent of the engine that will consume the data (Drill or otherwise) - it concerns how to create a connection, getting metadata, and pushing down logical operations, and generating queries in the target system’s query language. Calcite and Drill ought to be able to share that part, rather than maintaining separate collections of adapters.
> 
> Julian
>

Re: Thinking about Drill 2.0

Posted by Julian Hyde <jh...@apache.org>.

> On Jun 5, 2017, at 11:59 AM, Paul Rogers <pr...@mapr.com> wrote:
> 
> Similarly, the storage plugin API exposes details of Calcite (which seems to evolve with each new version), exposes value vector implementations, and so on. A cleaner, simpler, more isolated API will allow storage plugins to be built faster, but will also isolate them from Drill internals changes. Without isolation, each change to Drill internals would require plugin authors to update their plugin before Drill can be released.

Sorry you’re getting burned by Calcite changes. We try to minimize impact, but sometimes it’s difficult to see what you’re breaking.

I like the goal of a stable storage plugin API. Maybe it’s something Drill and Calcite can collaborate on? Much of the DNA of an adapter is independent of the engine that will consume the data (Drill or otherwise) - it concerns how to create a connection, getting metadata, and pushing down logical operations, and generating queries in the target system’s query language. Calcite and Drill ought to be able to share that part, rather than maintaining separate collections of adapters.

Julian

Re: Thinking about Drill 2.0

Posted by Jinfeng Ni <jn...@apache.org>.

Agreed with the two items Parth listed: Schema free support and improve
Drill's execution architecture.

Schema free (or Schema-on-read) is one main feature that differentiates
Drill from other similar projects. There seems to be a still long list to
improve towards fully support of this feature. I feel Arrow integration
probably would help. In that sense,  we need spend effort to investigate
and integrate Drill with Arrow library. (The UnionVector in Drill side
seems to be not fully completed, compared to what Arrow offers).

The second item is also critical, as Drill uses more than necessary
CPU/threads/memory in many cases.

Regarding API or interface, we probably need put into two categories, one
for application (API), the other one for server (SPI).  I would assume
storage plugin or UDF interface would be put into the second category. When
we discuss compatibility, we may have different requirements for different
categories.

Getting off Calcite/Parquet fork is important, but I feel it may not have
to be a prerequisite for 2.0.



On Mon, Jun 5, 2017 at 1:53 PM, Parth Chandra <pa...@apache.org> wrote:

> Adding to my list of things to consider for Drill 2.0,  I would think that
> getting Drill off our forks of Calcite and Parquet should also be a goal,
> though a tactical one.
>
>
>
> On Mon, Jun 5, 2017 at 1:51 PM, Parth Chandra <pa...@apache.org> wrote:
>
> > Nice suggestion Paul, to start a discussion on 2.0 (it's about time). I
> > would like to make this a broader discussion than just APIs, though APIs
> > are a good place to start. In particular. we usually get the opportunity
> to
> > break backward compatibility only for a major release and that is the
> time
> > we have to finalize the APIs.
> >
> > In the broader discussion I feel we also need to consider some other
> > aspects -
> >   1) Formalize Drill's support for schema free operations.
> >   2) Drill's execution engine architecture and it's 'optimistic' use of
> > resources.
> >
> > Re the APIs:
> >   One more public API is the UDFs. This and the storage plugin APIs
> > together are tied at the hip with vectors and memory management. I'm not
> > sure if we can cleanly separate the underlying representation of vectors
> > from the interfaces to these APIs, but I agree we need to clarify this
> > part. For instance, some of the performance benefits in the Parquet scan
> > come from vectorizing writes to the vector especially for null or
> repeated
> > values. We could provide interfaces to provide the same without which the
> > scans would have to be vector-internals aware. The same goes for UDFs.
> > Assuming that a 2.0 goal would be to provide vectorized interfaces for
> > users to write table (or aggregate) UDFs, one now needs a standardized
> data
> > set representation. If you choose this data set representation to be
> > columnar (for better vectorization), will you end up with
> ValueVector/Arrow
> > based RecordBatches? I included Arrow in this since the project is
> > formalizing exactly this requirement.
> >
> > For the client APIs, I believe that ODBC and JDBC drivers initially were
> > written using record based APIs provided by vendors, but to get better
> > performance started to move to working with raw streams coming over the
> > wire (eg TDS with Sybase/MS-SQLServer [1] ). So what Drill does is in
> fact
> > similar to that approach. The client APIs are really thin layers on top
> of
> > the vector data stream and provide row based, read only access to the
> > vector.
> >
> > Lest I begin to sound too contrary,  thank you for starting this
> > discussion. It is really needed!
> >
> > Parth
> >
> >
> >
> >
> >
> >
> >
> > On Mon, Jun 5, 2017 at 11:59 AM, Paul Rogers <pr...@mapr.com> wrote:
> >
> >> Hi All,
> >>
> >> A while back there was a discussion about the scope of Drill 2.0. Got me
> >> thinking about possible topics. My two cents:
> >>
> >> Drill 2.0 should focus on making Drill’s external APIs production ready.
> >> This means five things:
> >>
> >> * Clearly identify and define each API.
> >> * (Re)design each API to ensure it fully isolates the client from Drill
> >> internals.
> >> * Ensure the API allows full version compatibility: Allow mixing of
> >> old/new clients and servers with some limits.
> >> * Fully test each API.
> >> * Fully document each API.
> >>
> >> Once client code is isolated from Drill internals, we are free to evolve
> >> the internals in either Drill 2.0 or a later release.
> >>
> >> In my mind, the top APIs to revisit are:
> >>
> >> * The drill client API.
> >> * The storage plugin API.
> >>
> >> (Explanation below.)
> >>
> >> What other APIs should we consider? Here are some examples, please
> >> suggest items you know about:
> >>
> >> * Command line scripts and arguments
> >> * REST API
> >> * Names and contents of system tables
> >> * Structure of the storage plugin configuration JSON
> >> * Structure of the query profile
> >> * Structure of the EXPLAIN PLAN output.
> >> * Semantics of Drill functions, such as the date functions recently
> >> partially fixed by adding “ANSI” alternatives.
> >> * Naming of config and system/session options.
> >> * (Your suggestions here…)
> >>
> >> I’ve taken the liberty of moving some API-breaking tickets in the Apache
> >> Drill JIRA to 2.0. Perhaps we can add others so that we have a good
> >> inventory of 2.0 candidates.
> >>
> >> Here are the reasons for my two suggestions.
> >>
> >> Today, we expose Drill value vectors to the client. This means if we
> want
> >> to enhance anything about Drill’s internal memory format (i.e. value
> >> vectors, such as a possible move to Arrow), we break compatibility with
> old
> >> clients. Using value vectors also means we need a very large percentage
> of
> >> Drill’s internal code on the client in Java or C++. We are learning that
> >> doing so is a challenge.
> >>
> >> A new client API should follow established SQL database tradition: a
> >> synchronous, row-based API designed for versioning, for forward and
> >> backward compatibility, and to support ODBC and JDBC users.
> >>
> >> We can certainly maintain the existing full, async, heavy-weight client
> >> for our tests and for applications that would benefit from it.
> >>
> >> Once we define a new API, we are free to alter Drill’s value vectors to,
> >> say, add the needed null states to fully support JSON, to change offset
> >> vectors to not need n+1 values (which doubles vector size in 64K
> batches),
> >> and so on. Since vectors become private to Drill (or Arrow) after the
> new
> >> client API, we are free to innovate to improve them.
> >>
> >> Similarly, the storage plugin API exposes details of Calcite (which
> seems
> >> to evolve with each new version), exposes value vector implementations,
> and
> >> so on. A cleaner, simpler, more isolated API will allow storage plugins
> to
> >> be built faster, but will also isolate them from Drill internals
> changes.
> >> Without isolation, each change to Drill internals would require plugin
> >> authors to update their plugin before Drill can be released.
> >>
> >> Thoughts? Suggestions?
> >>
> >> Thanks,
> >>
> >> - Paul
> >
> >
> >
>

Re: Thinking about Drill 2.0

Posted by Parth Chandra <pa...@apache.org>.

Adding to my list of things to consider for Drill 2.0,  I would think that
getting Drill off our forks of Calcite and Parquet should also be a goal,
though a tactical one.



On Mon, Jun 5, 2017 at 1:51 PM, Parth Chandra <pa...@apache.org> wrote:

> Nice suggestion Paul, to start a discussion on 2.0 (it's about time). I
> would like to make this a broader discussion than just APIs, though APIs
> are a good place to start. In particular. we usually get the opportunity to
> break backward compatibility only for a major release and that is the time
> we have to finalize the APIs.
>
> In the broader discussion I feel we also need to consider some other
> aspects -
>   1) Formalize Drill's support for schema free operations.
>   2) Drill's execution engine architecture and it's 'optimistic' use of
> resources.
>
> Re the APIs:
>   One more public API is the UDFs. This and the storage plugin APIs
> together are tied at the hip with vectors and memory management. I'm not
> sure if we can cleanly separate the underlying representation of vectors
> from the interfaces to these APIs, but I agree we need to clarify this
> part. For instance, some of the performance benefits in the Parquet scan
> come from vectorizing writes to the vector especially for null or repeated
> values. We could provide interfaces to provide the same without which the
> scans would have to be vector-internals aware. The same goes for UDFs.
> Assuming that a 2.0 goal would be to provide vectorized interfaces for
> users to write table (or aggregate) UDFs, one now needs a standardized data
> set representation. If you choose this data set representation to be
> columnar (for better vectorization), will you end up with ValueVector/Arrow
> based RecordBatches? I included Arrow in this since the project is
> formalizing exactly this requirement.
>
> For the client APIs, I believe that ODBC and JDBC drivers initially were
> written using record based APIs provided by vendors, but to get better
> performance started to move to working with raw streams coming over the
> wire (eg TDS with Sybase/MS-SQLServer [1] ). So what Drill does is in fact
> similar to that approach. The client APIs are really thin layers on top of
> the vector data stream and provide row based, read only access to the
> vector.
>
> Lest I begin to sound too contrary,  thank you for starting this
> discussion. It is really needed!
>
> Parth
>
>
>
>
>
>
>
> On Mon, Jun 5, 2017 at 11:59 AM, Paul Rogers <pr...@mapr.com> wrote:
>
>> Hi All,
>>
>> A while back there was a discussion about the scope of Drill 2.0. Got me
>> thinking about possible topics. My two cents:
>>
>> Drill 2.0 should focus on making Drill’s external APIs production ready.
>> This means five things:
>>
>> * Clearly identify and define each API.
>> * (Re)design each API to ensure it fully isolates the client from Drill
>> internals.
>> * Ensure the API allows full version compatibility: Allow mixing of
>> old/new clients and servers with some limits.
>> * Fully test each API.
>> * Fully document each API.
>>
>> Once client code is isolated from Drill internals, we are free to evolve
>> the internals in either Drill 2.0 or a later release.
>>
>> In my mind, the top APIs to revisit are:
>>
>> * The drill client API.
>> * The storage plugin API.
>>
>> (Explanation below.)
>>
>> What other APIs should we consider? Here are some examples, please
>> suggest items you know about:
>>
>> * Command line scripts and arguments
>> * REST API
>> * Names and contents of system tables
>> * Structure of the storage plugin configuration JSON
>> * Structure of the query profile
>> * Structure of the EXPLAIN PLAN output.
>> * Semantics of Drill functions, such as the date functions recently
>> partially fixed by adding “ANSI” alternatives.
>> * Naming of config and system/session options.
>> * (Your suggestions here…)
>>
>> I’ve taken the liberty of moving some API-breaking tickets in the Apache
>> Drill JIRA to 2.0. Perhaps we can add others so that we have a good
>> inventory of 2.0 candidates.
>>
>> Here are the reasons for my two suggestions.
>>
>> Today, we expose Drill value vectors to the client. This means if we want
>> to enhance anything about Drill’s internal memory format (i.e. value
>> vectors, such as a possible move to Arrow), we break compatibility with old
>> clients. Using value vectors also means we need a very large percentage of
>> Drill’s internal code on the client in Java or C++. We are learning that
>> doing so is a challenge.
>>
>> A new client API should follow established SQL database tradition: a
>> synchronous, row-based API designed for versioning, for forward and
>> backward compatibility, and to support ODBC and JDBC users.
>>
>> We can certainly maintain the existing full, async, heavy-weight client
>> for our tests and for applications that would benefit from it.
>>
>> Once we define a new API, we are free to alter Drill’s value vectors to,
>> say, add the needed null states to fully support JSON, to change offset
>> vectors to not need n+1 values (which doubles vector size in 64K batches),
>> and so on. Since vectors become private to Drill (or Arrow) after the new
>> client API, we are free to innovate to improve them.
>>
>> Similarly, the storage plugin API exposes details of Calcite (which seems
>> to evolve with each new version), exposes value vector implementations, and
>> so on. A cleaner, simpler, more isolated API will allow storage plugins to
>> be built faster, but will also isolate them from Drill internals changes.
>> Without isolation, each change to Drill internals would require plugin
>> authors to update their plugin before Drill can be released.
>>
>> Thoughts? Suggestions?
>>
>> Thanks,
>>
>> - Paul
>
>
>

Re: Thinking about Drill 2.0

Posted by Parth Chandra <pa...@apache.org>.

Nice suggestion Paul, to start a discussion on 2.0 (it's about time). I
would like to make this a broader discussion than just APIs, though APIs
are a good place to start. In particular. we usually get the opportunity to
break backward compatibility only for a major release and that is the time
we have to finalize the APIs.

In the broader discussion I feel we also need to consider some other
aspects -
  1) Formalize Drill's support for schema free operations.
  2) Drill's execution engine architecture and it's 'optimistic' use of
resources.

Re the APIs:
  One more public API is the UDFs. This and the storage plugin APIs
together are tied at the hip with vectors and memory management. I'm not
sure if we can cleanly separate the underlying representation of vectors
from the interfaces to these APIs, but I agree we need to clarify this
part. For instance, some of the performance benefits in the Parquet scan
come from vectorizing writes to the vector especially for null or repeated
values. We could provide interfaces to provide the same without which the
scans would have to be vector-internals aware. The same goes for UDFs.
Assuming that a 2.0 goal would be to provide vectorized interfaces for
users to write table (or aggregate) UDFs, one now needs a standardized data
set representation. If you choose this data set representation to be
columnar (for better vectorization), will you end up with ValueVector/Arrow
based RecordBatches? I included Arrow in this since the project is
formalizing exactly this requirement.

For the client APIs, I believe that ODBC and JDBC drivers initially were
written using record based APIs provided by vendors, but to get better
performance started to move to working with raw streams coming over the
wire (eg TDS with Sybase/MS-SQLServer [1] ). So what Drill does is in fact
similar to that approach. The client APIs are really thin layers on top of
the vector data stream and provide row based, read only access to the
vector.

Lest I begin to sound too contrary,  thank you for starting this
discussion. It is really needed!

Parth

On Mon, Jun 5, 2017 at 11:59 AM, Paul Rogers <pr...@mapr.com> wrote:

> Hi All,
>
> A while back there was a discussion about the scope of Drill 2.0. Got me
> thinking about possible topics. My two cents:
>
> Drill 2.0 should focus on making Drill’s external APIs production ready.
> This means five things:
>
> * Clearly identify and define each API.
> * (Re)design each API to ensure it fully isolates the client from Drill
> internals.
> * Ensure the API allows full version compatibility: Allow mixing of
> old/new clients and servers with some limits.
> * Fully test each API.
> * Fully document each API.
>
> Once client code is isolated from Drill internals, we are free to evolve
> the internals in either Drill 2.0 or a later release.
>
> In my mind, the top APIs to revisit are:
>
> * The drill client API.
> * The storage plugin API.
>
> (Explanation below.)
>
> What other APIs should we consider? Here are some examples, please suggest
> items you know about:
>
> * Command line scripts and arguments
> * REST API
> * Names and contents of system tables
> * Structure of the storage plugin configuration JSON
> * Structure of the query profile
> * Structure of the EXPLAIN PLAN output.
> * Semantics of Drill functions, such as the date functions recently
> partially fixed by adding “ANSI” alternatives.
> * Naming of config and system/session options.
> * (Your suggestions here…)
>
> I’ve taken the liberty of moving some API-breaking tickets in the Apache
> Drill JIRA to 2.0. Perhaps we can add others so that we have a good
> inventory of 2.0 candidates.
>
> Here are the reasons for my two suggestions.
>
> Today, we expose Drill value vectors to the client. This means if we want
> to enhance anything about Drill’s internal memory format (i.e. value
> vectors, such as a possible move to Arrow), we break compatibility with old
> clients. Using value vectors also means we need a very large percentage of
> Drill’s internal code on the client in Java or C++. We are learning that
> doing so is a challenge.
>
> A new client API should follow established SQL database tradition: a
> synchronous, row-based API designed for versioning, for forward and
> backward compatibility, and to support ODBC and JDBC users.
>
> We can certainly maintain the existing full, async, heavy-weight client
> for our tests and for applications that would benefit from it.
>
> Once we define a new API, we are free to alter Drill’s value vectors to,
> say, add the needed null states to fully support JSON, to change offset
> vectors to not need n+1 values (which doubles vector size in 64K batches),
> and so on. Since vectors become private to Drill (or Arrow) after the new
> client API, we are free to innovate to improve them.
>
> Similarly, the storage plugin API exposes details of Calcite (which seems
> to evolve with each new version), exposes value vector implementations, and
> so on. A cleaner, simpler, more isolated API will allow storage plugins to
> be built faster, but will also isolate them from Drill internals changes.
> Without isolation, each change to Drill internals would require plugin
> authors to update their plugin before Drill can be released.
>
> Thoughts? Suggestions?
>
> Thanks,
>
> - Paul