You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@drill.apache.org by Charles Givre <cg...@gmail.com> on 2019/01/29 01:43:38 UTC

Re: "Crude-but-effective" Arrow integration

Hey Paul, 
I’m curious as to what, if anything ever came of this thread?  IMHO, you’re on to something here.  We could get the benefit of Arrow—specifically the interoperability with other big data tools—without the pain of having to completely re-work Drill. This seems like a real win-win to me. 
— C

> On Aug 20, 2018, at 13:51, Paul Rogers <pa...@yahoo.com.INVALID> wrote:
> 
> Hi Ted,
> 
> We may be confusing two very different ideas. The one is a Drill-to-Arrow adapter on Drill's periphery, this is the "crude-but-effective" integration suggestion. On the periphery we are not changing existing code, we're just building an adapter to read Arrow data into Drill, or convert Drill output to Arrow.
> 
> The other idea, being discussed in a parallel thread, is to convert Drill's runtime engine to use Arrow. That is a whole other beast.
> 
> When changing Drill internals, code must change. There is a cost associated with that. Whether the Arrow code is better or not is not the key question. Rather, the key question is simply the volume of changes.
> 
> Drill divides into roughly two main layers: plan-time and run-time. Plan-time is not much affected by Arrow. But, run-time code is all about manipulating vectors and their metadata, often in quite detailed ways with APIs unique to Drill. While swapping Arrow vectors for Drill vectors is conceptually simple, those of us who've looked at the details have noted that the sheer volume of the lines of code that must change is daunting.
> 
> Would be good to get second options. That PR I mentioned will show the volume of code that changed at that time (but Drill has grown since then.) Parth is another good resource as he reviewed the original PR and has kept a close eye on Arrow.
> 
> When considering Arrow in the Drill execution engine, we must realistically understand the cost then ask, do the benefits we gain justify those costs? Would Arrow be the highest-priority investment? Frankly, would Arrow integration increase Drill adoption more than the many other topics discussed recently on these mail lists?
> 
> Charles and others make a strong case for Arrow for integration. What is the strong case for Drill's internals? That's really the question the group will want to answer.
> 
> More details below.
> 
> Thanks,
> - Paul
> 
> 
> 
>    On Monday, August 20, 2018, 9:41:49 AM PDT, Ted Dunning <te...@gmail.com> wrote:  
> 
> Inline.
> 
> 
> On Mon, Aug 20, 2018 at 9:20 AM Paul Rogers <pa...@yahoo.com.invalid>
> wrote:
> 
>> ...
>> By contrast, migrating Drill internals to Arrow has always been seen as
>> the bulk of the cost; costs which the "crude-but-effective" suggestion
>> seeks to avoid. Some of the full-integration costs include:
>> 
>> * Reworking Drill's direct memory model to work with Arrow's.
>> 
> 
> 
> Ted: This should be relatively isolated to the allocation/deallocation code. The
> deallocation should become a no-op. The allocation becomes simpler and
> safer.
> 
> Paul: If only that were true. Drill has an ingenious integration of vector allocation and Netty. Arrow may have done the same. (Probably did, since such integration is key to avoiding copies on send/receive.). That code is highly complex. Clearly, the swap can be done; it will simply take some work to get right.
> 
> 
>> * Changing all low-level runtime code that works with vectors to instead
>> work with Arrow vectors.
>> 
> 
> 
> Ted: Why? You already said that most code doesn't have to change since the
> format is the same.
> 
> Paul: My comment about the format being the same was that the direct memory layout is the same, allowing conversion of a Drill vector to an Arrow vector by relabeling the direct memory that holds the data.
> 
> Paul: But, in the Drill runtime engine, we don't work with the memory directly, we use the vector APIs, mutator APIs and so on. These all changed in Arrow. Granted, the Arrow versions are cleaner. But, that does mean that every vector reference (of which there are thousands) must be revised to use the Arrow APIs. That is the cost that has put us off a bit.
> 
> 
>> * Change all Drill's vector metadata, and code that uses that metadata, to
>> use Arrow's metadata instead.
>> 
> 
> 
> Ted: Why? You said that converting Arrow metadata to Drill's metadata would be
> simple. Why not just continue with that?
> 
> Paul: In an API, we can convert one data structure to the other by writing code to copy data. But, if we change Drill's internals, we must rewrite code in every operator that uses Drill's metadata to instead use Arrows. That is a much more extensive undertaking than simply converting metadata on input or output.
> 
> 
>> * Since generated code works directly with vectors, change all the code
>> generation.
>> 
> 
> Ted: Why? You said the UDFs would just work.
> 
> Paul: Again, I fear we are confusing two issues. If we don't change Drill's internals, then UDFs will work as today. If we do change Drill to Arrow, then, since UDFs are part of the code gen system, they must change to adapt to the Arrow APIs. Specially, Drill "holders" must be converted to Arrow holders. Drill complex writers must convert to Arrow complex writers.
> 
> Paul: Here I'll point out that the Arrow vector code and writers have the same uncontrolled memory flaw that they inherited from Drill. So, if we replace the mutators and writers, we might as well use the "result set loader" model which a) hides the details, and b) manages memory to a given budget.  Either way, UDFs must change if we move to Arrow for Drill internals.
> 
> 
>> * Since Drill vectors and metadata are exposed via the Drill client to
>> JDBC and ODBC, those must be revised as well.
>> 
> 
> Ted: How much given the high level of compatibility?
> 
> Paul: As with Drill internals, all JDBC/ODBC code that uses Drill vector and metadata classes must be revised to use Arrow vectors and metadata, adapting the code to the changed APIs. This is not a huge technical challenge, it is just a pile of work. Perhaps this was done in that Arrow conversion PR.
> 
> 
> 
>> * Since the wire format will change, clients of Drill must upgrade their
>> JDBC/ODBC drivers when migrating to an Arrow-based Drill.>
> 
> 
> Ted: Doesn't this have to happen fairly often anyway?
> 
> Ted: Perhaps this would be a good excuse for a 2.0 step.
> 
> Paul: As Drill matures, users would appreciate the ability to use JDBC and ODBC drivers with multiple Drill versions. If a shop has 1000 desktops using the drivers against five Drill clusters, it is impractical to upgrade everything in one go.
> 
> Paul: You hit the nail on the head: conversion to Arrow would justify a jump to "Drill 2.0" to explain the required big-bang upgrade (and, to highlight the cool new capabilities that come with Arrow.)
>

Re: "Crude-but-effective" Arrow integration

Posted by Paul Rogers <pa...@yahoo.com.INVALID>.

Hi Jim,

Thanks for the description of the real-world use case. I like your idea of letting Drill do the grunt work, then letting the ML/AI workload focus on that aspect of the problem.

Charles, just brainstorming a bit, I think the easiest way to start is to create a simple, stand-alone server that speaks Arrow to the client, and uses the native Drill client to speak to Drill. The native Drill client exposes Drill value vectors.

One trick would be to convert Drill vectors to the Arrow format. I think that data vectors are the same format. Possibly offset vectors. I think Arrow went its own way with null-value (Drill's is-set) vectors. So, some conversion might be a no-op, others might need to rewrite a vector. Good thing, this is purely at the vector level, so would be easy to write.

The next issue is the one that Parth has long pointed out: Drill and Arrow each have their own memory allocators. How could we share a data vector between the two? The simplest initial solution is just to copy the data from Drill to Arrow. Slow, but transparent to the client.

A crude first-approximation of the development steps:

1. Create the client shell server.
2. Implement the Arrow client protocol. Need some way to accept a query and return batches of results.
3. Forward the query to Drill using the native Drill client.
4. As a first pass, copy vectors from Drill to Arrow and return them to the client.
5. Then, solve that memory allocator problem to pass data without copying.

Once the experimental work is done in the stand-alone server, the next step is to consider merging it into Drill itself for better performance. Still, it may be that, since all the bridge does is transform data, it works fine as a separate process.

FWIW, I did a prototype of something similar a couple of years ago. The idea was to convert Drill's vectors to a row format, but the overall idea is similar. Might be one or two ideas to get someone started. [1] One thing I'd do differently today is to use something like gRPC instead of Netty for the RPC layer.

Something like this is well isolated and not hard if you take it step-by-step. That's why it seemed a good Summer of Code project for an enterprising student interested in networking and data munging.

Thanks,
- Paul

[1] https://github.com/paul-rogers/drill-jig

    On Wednesday, January 30, 2019, 10:18:47 AM PST, Charles Givre <cg...@gmail.com> wrote:  

 Jim,
I really like this use case.  As a data scientist myself, I see the big value of Drill as being able to rapidly get raw data ready for machine learning.  This would be great if we could do this!

> On Jan 30, 2019, at 08:43, Jim Scott <js...@mapr.com> wrote:
> 
> Paul,
> 
> Your example is exactly the same as one which I spoke with some people on
> the RAPIDS.ai project about. Using Drill as a tool to gather (query) all
> the data to get a representative data set for an ML/AI workload, then
> feeding the resultset directly into GPU memory. RAPIDS.ai is based on Arrow
> which created a GPU Data Frame. The whole point of that project was to
> reduce total number of memcopy operations to result in an end-to-end speed
> up.
> 
> That model to allow Drill to plug into other tools would be a GREAT use
> case for Drill.
> 
> Jim
> 
> On Wed, Jan 30, 2019 at 2:17 AM Paul Rogers <par0328@yahoo.com.invalid <ma...@yahoo.com.invalid>>
> wrote:
> 
>> Hi Aman,
>> 
>> Thanks for sharing the update. Glad to hear things are still percolating.
>> 
>> I think Drill is an under appreciated treasure for doing queries in the
>> complex systems that folks seem to be building today. The ability to read
>> multiple data sources is something that maybe only Spark can do as well.
>> (And Spark can't act as a general purpose query engine like Drill can.)
>> Adding Arrow support for input and output would build on this advantage.
>> 
>> I wonder if the output (client) side might be a great first start. Could
>> be build as a separate app just by combining Arrow and the Drill client
>> code together. Would let lots of Arrow-aware apps query data with Drill
>> rather than having to write their own readers, own filters, own aggregators
>> and, in the end, their own query engine.
>> 
>> Charles was asking about Summer of Code ideas. This might be one: a
>> stand-alone Drill-to-arrow bridge. I think Arrow has an RPC layer. Add that
>> and any Arrow tool in any language could talk to Drill via the bridge.
>> 
>> Thanks,
>> - Paul
>> 
>> 
>> 
>>    On Tuesday, January 29, 2019, 1:54:30 PM PST, Aman Sinha <
>> amansinha@gmail.com> wrote:
>> 
>> Hi Charles,
>> You may have seen the talk that was given on the Drill Developer Day [1] by
>> Karthik and me ... look for the slides on 'Drill-Arrow Integration' which
>> describes 2 high level options and what the integration might entail.
>> Option 1 corresponds to what you and Paul are discussing in this thread.
>> Option 2 is the deeper integration.  We do plan to work on one of them (not
>> finalized yet) but it will likely be after 1.16.0 since Statistics support
>> and Resource Manager related tasks (these were also discussed in the
>> Developer Day) are consuming our time.  If you are interested in
>> contributing/collaborating, let me know.
>> 
>> [1]
>> 
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__drive.google.com_drive_folders_17I2jZq2HdDwUDXFOIg1Vecry8yGTDWhn&d=DwIFaQ&c=cskdkSMqhcnjZxdQVpwTXg&r=QCNo6Od9yBrl9o0wvYLOA97m53QHz3nzbe8yPRFgMso&m=49qJveZG0Wk1sCxdbEX9S34uYxi7ndkpmnzLBpns9CQ&s=qLa8hfgTP2F51grPeHfwtnXZs_O09OR7vkNWBg5sXHc&e= <https://urldefense.proofpoint.com/v2/url?u=https-3A__drive.google.com_drive_folders_17I2jZq2HdDwUDXFOIg1Vecry8yGTDWhn&d=DwIFaQ&c=cskdkSMqhcnjZxdQVpwTXg&r=QCNo6Od9yBrl9o0wvYLOA97m53QHz3nzbe8yPRFgMso&m=49qJveZG0Wk1sCxdbEX9S34uYxi7ndkpmnzLBpns9CQ&s=qLa8hfgTP2F51grPeHfwtnXZs_O09OR7vkNWBg5sXHc&e=>
>> 
>> Aman
>>

Re: "Crude-but-effective" Arrow integration

Posted by Charles Givre <cg...@gmail.com>.

Jim,
I really like this use case.  As a data scientist myself, I see the big value of Drill as being able to rapidly get raw data ready for machine learning.  This would be great if we could do this!

> On Jan 30, 2019, at 08:43, Jim Scott <js...@mapr.com> wrote:
> 
> Paul,
> 
> Your example is exactly the same as one which I spoke with some people on
> the RAPIDS.ai project about. Using Drill as a tool to gather (query) all
> the data to get a representative data set for an ML/AI workload, then
> feeding the resultset directly into GPU memory. RAPIDS.ai is based on Arrow
> which created a GPU Data Frame. The whole point of that project was to
> reduce total number of memcopy operations to result in an end-to-end speed
> up.
> 
> That model to allow Drill to plug into other tools would be a GREAT use
> case for Drill.
> 
> Jim
> 
> On Wed, Jan 30, 2019 at 2:17 AM Paul Rogers <par0328@yahoo.com.invalid <ma...@yahoo.com.invalid>>
> wrote:
> 
>> Hi Aman,
>> 
>> Thanks for sharing the update. Glad to hear things are still percolating.
>> 
>> I think Drill is an under appreciated treasure for doing queries in the
>> complex systems that folks seem to be building today. The ability to read
>> multiple data sources is something that maybe only Spark can do as well.
>> (And Spark can't act as a general purpose query engine like Drill can.)
>> Adding Arrow support for input and output would build on this advantage.
>> 
>> I wonder if the output (client) side might be a great first start. Could
>> be build as a separate app just by combining Arrow and the Drill client
>> code together. Would let lots of Arrow-aware apps query data with Drill
>> rather than having to write their own readers, own filters, own aggregators
>> and, in the end, their own query engine.
>> 
>> Charles was asking about Summer of Code ideas. This might be one: a
>> stand-alone Drill-to-arrow bridge. I think Arrow has an RPC layer. Add that
>> and any Arrow tool in any language could talk to Drill via the bridge.
>> 
>> Thanks,
>> - Paul
>> 
>> 
>> 
>>    On Tuesday, January 29, 2019, 1:54:30 PM PST, Aman Sinha <
>> amansinha@gmail.com> wrote:
>> 
>> Hi Charles,
>> You may have seen the talk that was given on the Drill Developer Day [1] by
>> Karthik and me ... look for the slides on 'Drill-Arrow Integration' which
>> describes 2 high level options and what the integration might entail.
>> Option 1 corresponds to what you and Paul are discussing in this thread.
>> Option 2 is the deeper integration.  We do plan to work on one of them (not
>> finalized yet) but it will likely be after 1.16.0 since Statistics support
>> and Resource Manager related tasks (these were also discussed in the
>> Developer Day) are consuming our time.  If you are interested in
>> contributing/collaborating, let me know.
>> 
>> [1]
>> 
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__drive.google.com_drive_folders_17I2jZq2HdDwUDXFOIg1Vecry8yGTDWhn&d=DwIFaQ&c=cskdkSMqhcnjZxdQVpwTXg&r=QCNo6Od9yBrl9o0wvYLOA97m53QHz3nzbe8yPRFgMso&m=49qJveZG0Wk1sCxdbEX9S34uYxi7ndkpmnzLBpns9CQ&s=qLa8hfgTP2F51grPeHfwtnXZs_O09OR7vkNWBg5sXHc&e= <https://urldefense.proofpoint.com/v2/url?u=https-3A__drive.google.com_drive_folders_17I2jZq2HdDwUDXFOIg1Vecry8yGTDWhn&d=DwIFaQ&c=cskdkSMqhcnjZxdQVpwTXg&r=QCNo6Od9yBrl9o0wvYLOA97m53QHz3nzbe8yPRFgMso&m=49qJveZG0Wk1sCxdbEX9S34uYxi7ndkpmnzLBpns9CQ&s=qLa8hfgTP2F51grPeHfwtnXZs_O09OR7vkNWBg5sXHc&e=>
>> 
>> Aman
>> 
>> On Tue, Jan 29, 2019 at 12:08 AM Paul Rogers <pa...@yahoo.com.invalid>
>> wrote:
>> 
>>> Hi Charles,
>>> I didn't see anything on this on the public mailing list. Haven't seen
>> any
>>> commits related to it either. My guess is that this kind of interface is
>>> not important for the kind of data warehouse use cases that MapR is
>>> probably still trying to capture.
>>> I followed the Arrow mailing lists for much of last year. Not much
>>> activity in the Java arena. (I think most of that might be done by
>> Dremio.)
>>> Most activity in other languages. The code itself has drifted far away
>> from
>>> the original Drill structure. I found that even the metadata had vastly
>>> changed; turned out to be far too much work to port the "Row Set" stuff I
>>> did for Drill.
>>> This does mean, BTW, that the Drill folks did the right thing by not
>>> following Arrow. They'd have spend a huge amount of time tracking the
>>> massive changes.
>>> Still, converting Arrow vectors to Drill vectors might be an exercise in
>>> bit twirling and memory ownership. Harder now than it once was since I
>>> think Arrow defines all vectors to be nullable, and uses a different
>> scheme
>>> than Drill for representing nulls.
>>> Thanks,
>>> - Paul
>>> 
>>> 
>>> 
>>>   On Monday, January 28, 2019, 5:54:12 PM PST, Charles Givre <
>>> cgivre@gmail.com> wrote:
>>> 
>>> Hey Paul,
>>> I’m curious as to what, if anything ever came of this thread?  IMHO,
>>> you’re on to something here.  We could get the benefit of
>>> Arrow—specifically the interoperability with other big data tools—without
>>> the pain of having to completely re-work Drill. This seems like a real
>>> win-win to me.
>>> — C
>>> 
>>>> On Aug 20, 2018, at 13:51, Paul Rogers <pa...@yahoo.com.INVALID>
>>> wrote:
>>>> 
>>>> Hi Ted,
>>>> 
>>>> We may be confusing two very different ideas. The one is a
>>> Drill-to-Arrow adapter on Drill's periphery, this is the
>>> "crude-but-effective" integration suggestion. On the periphery we are not
>>> changing existing code, we're just building an adapter to read Arrow data
>>> into Drill, or convert Drill output to Arrow.
>>>> 
>>>> The other idea, being discussed in a parallel thread, is to convert
>>> Drill's runtime engine to use Arrow. That is a whole other beast.
>>>> 
>>>> When changing Drill internals, code must change. There is a cost
>>> associated with that. Whether the Arrow code is better or not is not the
>>> key question. Rather, the key question is simply the volume of changes.
>>>> 
>>>> Drill divides into roughly two main layers: plan-time and run-time.
>>> Plan-time is not much affected by Arrow. But, run-time code is all about
>>> manipulating vectors and their metadata, often in quite detailed ways
>> with
>>> APIs unique to Drill. While swapping Arrow vectors for Drill vectors is
>>> conceptually simple, those of us who've looked at the details have noted
>>> that the sheer volume of the lines of code that must change is daunting.
>>>> 
>>>> Would be good to get second options. That PR I mentioned will show the
>>> volume of code that changed at that time (but Drill has grown since
>> then.)
>>> Parth is another good resource as he reviewed the original PR and has
>> kept
>>> a close eye on Arrow.
>>>> 
>>>> When considering Arrow in the Drill execution engine, we must
>>> realistically understand the cost then ask, do the benefits we gain
>> justify
>>> those costs? Would Arrow be the highest-priority investment? Frankly,
>> would
>>> Arrow integration increase Drill adoption more than the many other topics
>>> discussed recently on these mail lists?
>>>> 
>>>> Charles and others make a strong case for Arrow for integration. What
>> is
>>> the strong case for Drill's internals? That's really the question the
>> group
>>> will want to answer.
>>>> 
>>>> More details below.
>>>> 
>>>> Thanks,
>>>> - Paul
>>>> 
>>>> 
>>>> 
>>>>   On Monday, August 20, 2018, 9:41:49 AM PDT, Ted Dunning <
>>> ted.dunning@gmail.com> wrote:
>>>> 
>>>> Inline.
>>>> 
>>>> 
>>>> On Mon, Aug 20, 2018 at 9:20 AM Paul Rogers <par0328@yahoo.com.invalid
>>> 
>>>> wrote:
>>>> 
>>>>> ...
>>>>> By contrast, migrating Drill internals to Arrow has always been seen
>> as
>>>>> the bulk of the cost; costs which the "crude-but-effective" suggestion
>>>>> seeks to avoid. Some of the full-integration costs include:
>>>>> 
>>>>> * Reworking Drill's direct memory model to work with Arrow's.
>>>>> 
>>>> 
>>>> 
>>>> Ted: This should be relatively isolated to the allocation/deallocation
>>> code. The
>>>> deallocation should become a no-op. The allocation becomes simpler and
>>>> safer.
>>>> 
>>>> Paul: If only that were true. Drill has an ingenious integration of
>>> vector allocation and Netty. Arrow may have done the same. (Probably did,
>>> since such integration is key to avoiding copies on send/receive.). That
>>> code is highly complex. Clearly, the swap can be done; it will simply
>> take
>>> some work to get right.
>>>> 
>>>> 
>>>>> * Changing all low-level runtime code that works with vectors to
>> instead
>>>>> work with Arrow vectors.
>>>>> 
>>>> 
>>>> 
>>>> Ted: Why? You already said that most code doesn't have to change since
>>> the
>>>> format is the same.
>>>> 
>>>> Paul: My comment about the format being the same was that the direct
>>> memory layout is the same, allowing conversion of a Drill vector to an
>>> Arrow vector by relabeling the direct memory that holds the data.
>>>> 
>>>> Paul: But, in the Drill runtime engine, we don't work with the memory
>>> directly, we use the vector APIs, mutator APIs and so on. These all
>> changed
>>> in Arrow. Granted, the Arrow versions are cleaner. But, that does mean
>> that
>>> every vector reference (of which there are thousands) must be revised to
>>> use the Arrow APIs. That is the cost that has put us off a bit.
>>>> 
>>>> 
>>>>> * Change all Drill's vector metadata, and code that uses that
>> metadata,
>>> to
>>>>> use Arrow's metadata instead.
>>>>> 
>>>> 
>>>> 
>>>> Ted: Why? You said that converting Arrow metadata to Drill's metadata
>>> would be
>>>> simple. Why not just continue with that?
>>>> 
>>>> Paul: In an API, we can convert one data structure to the other by
>>> writing code to copy data. But, if we change Drill's internals, we must
>>> rewrite code in every operator that uses Drill's metadata to instead use
>>> Arrows. That is a much more extensive undertaking than simply converting
>>> metadata on input or output.
>>>> 
>>>> 
>>>>> * Since generated code works directly with vectors, change all the
>> code
>>>>> generation.
>>>>> 
>>>> 
>>>> Ted: Why? You said the UDFs would just work.
>>>> 
>>>> Paul: Again, I fear we are confusing two issues. If we don't change
>>> Drill's internals, then UDFs will work as today. If we do change Drill to
>>> Arrow, then, since UDFs are part of the code gen system, they must change
>>> to adapt to the Arrow APIs. Specially, Drill "holders" must be converted
>> to
>>> Arrow holders. Drill complex writers must convert to Arrow complex
>> writers.
>>>> 
>>>> Paul: Here I'll point out that the Arrow vector code and writers have
>>> the same uncontrolled memory flaw that they inherited from Drill. So, if
>> we
>>> replace the mutators and writers, we might as well use the "result set
>>> loader" model which a) hides the details, and b) manages memory to a
>> given
>>> budget.  Either way, UDFs must change if we move to Arrow for Drill
>>> internals.
>>>> 
>>>> 
>>>>> * Since Drill vectors and metadata are exposed via the Drill client to
>>>>> JDBC and ODBC, those must be revised as well.
>>>>> 
>>>> 
>>>> Ted: How much given the high level of compatibility?
>>>> 
>>>> Paul: As with Drill internals, all JDBC/ODBC code that uses Drill
>> vector
>>> and metadata classes must be revised to use Arrow vectors and metadata,
>>> adapting the code to the changed APIs. This is not a huge technical
>>> challenge, it is just a pile of work. Perhaps this was done in that Arrow
>>> conversion PR.
>>>> 
>>>> 
>>>> 
>>>>> * Since the wire format will change, clients of Drill must upgrade
>> their
>>>>> JDBC/ODBC drivers when migrating to an Arrow-based Drill.>
>>>> 
>>>> 
>>>> Ted: Doesn't this have to happen fairly often anyway?
>>>> 
>>>> Ted: Perhaps this would be a good excuse for a 2.0 step.
>>>> 
>>>> Paul: As Drill matures, users would appreciate the ability to use JDBC
>>> and ODBC drivers with multiple Drill versions. If a shop has 1000
>> desktops
>>> using the drivers against five Drill clusters, it is impractical to
>> upgrade
>>> everything in one go.
>>>> 
>>>> Paul: You hit the nail on the head: conversion to Arrow would justify a
>>> jump to "Drill 2.0" to explain the required big-bang upgrade (and, to
>>> highlight the cool new capabilities that come with Arrow.)
>>>> 
>>> 
> 
> 
> 
> -- 
> 
> 
> *Jim Scott*Mobile/Text | +1 (989) 450-0212
> [image: MapR logo]
> <https://mapr.com/?utm_source=signature&utm_medium=email&utm_campaign=mapr-logo <https://mapr.com/?utm_source=signature&utm_medium=email&utm_campaign=mapr-logo>>

Re: "Crude-but-effective" Arrow integration

Posted by Jim Scott <js...@mapr.com>.

Paul,

Your example is exactly the same as one which I spoke with some people on
the RAPIDS.ai project about. Using Drill as a tool to gather (query) all
the data to get a representative data set for an ML/AI workload, then
feeding the resultset directly into GPU memory. RAPIDS.ai is based on Arrow
which created a GPU Data Frame. The whole point of that project was to
reduce total number of memcopy operations to result in an end-to-end speed
up.

That model to allow Drill to plug into other tools would be a GREAT use
case for Drill.

Jim

On Wed, Jan 30, 2019 at 2:17 AM Paul Rogers <pa...@yahoo.com.invalid>
wrote:

> Hi Aman,
>
> Thanks for sharing the update. Glad to hear things are still percolating.
>
> I think Drill is an under appreciated treasure for doing queries in the
> complex systems that folks seem to be building today. The ability to read
> multiple data sources is something that maybe only Spark can do as well.
> (And Spark can't act as a general purpose query engine like Drill can.)
> Adding Arrow support for input and output would build on this advantage.
>
> I wonder if the output (client) side might be a great first start. Could
> be build as a separate app just by combining Arrow and the Drill client
> code together. Would let lots of Arrow-aware apps query data with Drill
> rather than having to write their own readers, own filters, own aggregators
> and, in the end, their own query engine.
>
> Charles was asking about Summer of Code ideas. This might be one: a
> stand-alone Drill-to-arrow bridge. I think Arrow has an RPC layer. Add that
> and any Arrow tool in any language could talk to Drill via the bridge.
>
> Thanks,
> - Paul
>
>
>
>     On Tuesday, January 29, 2019, 1:54:30 PM PST, Aman Sinha <
> amansinha@gmail.com> wrote:
>
>  Hi Charles,
> You may have seen the talk that was given on the Drill Developer Day [1] by
> Karthik and me ... look for the slides on 'Drill-Arrow Integration' which
> describes 2 high level options and what the integration might entail.
> Option 1 corresponds to what you and Paul are discussing in this thread.
> Option 2 is the deeper integration.  We do plan to work on one of them (not
> finalized yet) but it will likely be after 1.16.0 since Statistics support
> and Resource Manager related tasks (these were also discussed in the
> Developer Day) are consuming our time.  If you are interested in
> contributing/collaborating, let me know.
>
> [1]
>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__drive.google.com_drive_folders_17I2jZq2HdDwUDXFOIg1Vecry8yGTDWhn&d=DwIFaQ&c=cskdkSMqhcnjZxdQVpwTXg&r=QCNo6Od9yBrl9o0wvYLOA97m53QHz3nzbe8yPRFgMso&m=49qJveZG0Wk1sCxdbEX9S34uYxi7ndkpmnzLBpns9CQ&s=qLa8hfgTP2F51grPeHfwtnXZs_O09OR7vkNWBg5sXHc&e=
>
> Aman
>
> On Tue, Jan 29, 2019 at 12:08 AM Paul Rogers <pa...@yahoo.com.invalid>
> wrote:
>
> > Hi Charles,
> > I didn't see anything on this on the public mailing list. Haven't seen
> any
> > commits related to it either. My guess is that this kind of interface is
> > not important for the kind of data warehouse use cases that MapR is
> > probably still trying to capture.
> > I followed the Arrow mailing lists for much of last year. Not much
> > activity in the Java arena. (I think most of that might be done by
> Dremio.)
> > Most activity in other languages. The code itself has drifted far away
> from
> > the original Drill structure. I found that even the metadata had vastly
> > changed; turned out to be far too much work to port the "Row Set" stuff I
> > did for Drill.
> > This does mean, BTW, that the Drill folks did the right thing by not
> > following Arrow. They'd have spend a huge amount of time tracking the
> > massive changes.
> > Still, converting Arrow vectors to Drill vectors might be an exercise in
> > bit twirling and memory ownership. Harder now than it once was since I
> > think Arrow defines all vectors to be nullable, and uses a different
> scheme
> > than Drill for representing nulls.
> > Thanks,
> > - Paul
> >
> >
> >
> >    On Monday, January 28, 2019, 5:54:12 PM PST, Charles Givre <
> > cgivre@gmail.com> wrote:
> >
> >  Hey Paul,
> > I’m curious as to what, if anything ever came of this thread?  IMHO,
> > you’re on to something here.  We could get the benefit of
> > Arrow—specifically the interoperability with other big data tools—without
> > the pain of having to completely re-work Drill. This seems like a real
> > win-win to me.
> > — C
> >
> > > On Aug 20, 2018, at 13:51, Paul Rogers <pa...@yahoo.com.INVALID>
> > wrote:
> > >
> > > Hi Ted,
> > >
> > > We may be confusing two very different ideas. The one is a
> > Drill-to-Arrow adapter on Drill's periphery, this is the
> > "crude-but-effective" integration suggestion. On the periphery we are not
> > changing existing code, we're just building an adapter to read Arrow data
> > into Drill, or convert Drill output to Arrow.
> > >
> > > The other idea, being discussed in a parallel thread, is to convert
> > Drill's runtime engine to use Arrow. That is a whole other beast.
> > >
> > > When changing Drill internals, code must change. There is a cost
> > associated with that. Whether the Arrow code is better or not is not the
> > key question. Rather, the key question is simply the volume of changes.
> > >
> > > Drill divides into roughly two main layers: plan-time and run-time.
> > Plan-time is not much affected by Arrow. But, run-time code is all about
> > manipulating vectors and their metadata, often in quite detailed ways
> with
> > APIs unique to Drill. While swapping Arrow vectors for Drill vectors is
> > conceptually simple, those of us who've looked at the details have noted
> > that the sheer volume of the lines of code that must change is daunting.
> > >
> > > Would be good to get second options. That PR I mentioned will show the
> > volume of code that changed at that time (but Drill has grown since
> then.)
> > Parth is another good resource as he reviewed the original PR and has
> kept
> > a close eye on Arrow.
> > >
> > > When considering Arrow in the Drill execution engine, we must
> > realistically understand the cost then ask, do the benefits we gain
> justify
> > those costs? Would Arrow be the highest-priority investment? Frankly,
> would
> > Arrow integration increase Drill adoption more than the many other topics
> > discussed recently on these mail lists?
> > >
> > > Charles and others make a strong case for Arrow for integration. What
> is
> > the strong case for Drill's internals? That's really the question the
> group
> > will want to answer.
> > >
> > > More details below.
> > >
> > > Thanks,
> > > - Paul
> > >
> > >
> > >
> > >    On Monday, August 20, 2018, 9:41:49 AM PDT, Ted Dunning <
> > ted.dunning@gmail.com> wrote:
> > >
> > > Inline.
> > >
> > >
> > > On Mon, Aug 20, 2018 at 9:20 AM Paul Rogers <par0328@yahoo.com.invalid
> >
> > > wrote:
> > >
> > >> ...
> > >> By contrast, migrating Drill internals to Arrow has always been seen
> as
> > >> the bulk of the cost; costs which the "crude-but-effective" suggestion
> > >> seeks to avoid. Some of the full-integration costs include:
> > >>
> > >> * Reworking Drill's direct memory model to work with Arrow's.
> > >>
> > >
> > >
> > > Ted: This should be relatively isolated to the allocation/deallocation
> > code. The
> > > deallocation should become a no-op. The allocation becomes simpler and
> > > safer.
> > >
> > > Paul: If only that were true. Drill has an ingenious integration of
> > vector allocation and Netty. Arrow may have done the same. (Probably did,
> > since such integration is key to avoiding copies on send/receive.). That
> > code is highly complex. Clearly, the swap can be done; it will simply
> take
> > some work to get right.
> > >
> > >
> > >> * Changing all low-level runtime code that works with vectors to
> instead
> > >> work with Arrow vectors.
> > >>
> > >
> > >
> > > Ted: Why? You already said that most code doesn't have to change since
> > the
> > > format is the same.
> > >
> > > Paul: My comment about the format being the same was that the direct
> > memory layout is the same, allowing conversion of a Drill vector to an
> > Arrow vector by relabeling the direct memory that holds the data.
> > >
> > > Paul: But, in the Drill runtime engine, we don't work with the memory
> > directly, we use the vector APIs, mutator APIs and so on. These all
> changed
> > in Arrow. Granted, the Arrow versions are cleaner. But, that does mean
> that
> > every vector reference (of which there are thousands) must be revised to
> > use the Arrow APIs. That is the cost that has put us off a bit.
> > >
> > >
> > >> * Change all Drill's vector metadata, and code that uses that
> metadata,
> > to
> > >> use Arrow's metadata instead.
> > >>
> > >
> > >
> > > Ted: Why? You said that converting Arrow metadata to Drill's metadata
> > would be
> > > simple. Why not just continue with that?
> > >
> > > Paul: In an API, we can convert one data structure to the other by
> > writing code to copy data. But, if we change Drill's internals, we must
> > rewrite code in every operator that uses Drill's metadata to instead use
> > Arrows. That is a much more extensive undertaking than simply converting
> > metadata on input or output.
> > >
> > >
> > >> * Since generated code works directly with vectors, change all the
> code
> > >> generation.
> > >>
> > >
> > > Ted: Why? You said the UDFs would just work.
> > >
> > > Paul: Again, I fear we are confusing two issues. If we don't change
> > Drill's internals, then UDFs will work as today. If we do change Drill to
> > Arrow, then, since UDFs are part of the code gen system, they must change
> > to adapt to the Arrow APIs. Specially, Drill "holders" must be converted
> to
> > Arrow holders. Drill complex writers must convert to Arrow complex
> writers.
> > >
> > > Paul: Here I'll point out that the Arrow vector code and writers have
> > the same uncontrolled memory flaw that they inherited from Drill. So, if
> we
> > replace the mutators and writers, we might as well use the "result set
> > loader" model which a) hides the details, and b) manages memory to a
> given
> > budget.  Either way, UDFs must change if we move to Arrow for Drill
> > internals.
> > >
> > >
> > >> * Since Drill vectors and metadata are exposed via the Drill client to
> > >> JDBC and ODBC, those must be revised as well.
> > >>
> > >
> > > Ted: How much given the high level of compatibility?
> > >
> > > Paul: As with Drill internals, all JDBC/ODBC code that uses Drill
> vector
> > and metadata classes must be revised to use Arrow vectors and metadata,
> > adapting the code to the changed APIs. This is not a huge technical
> > challenge, it is just a pile of work. Perhaps this was done in that Arrow
> > conversion PR.
> > >
> > >
> > >
> > >> * Since the wire format will change, clients of Drill must upgrade
> their
> > >> JDBC/ODBC drivers when migrating to an Arrow-based Drill.>
> > >
> > >
> > > Ted: Doesn't this have to happen fairly often anyway?
> > >
> > > Ted: Perhaps this would be a good excuse for a 2.0 step.
> > >
> > > Paul: As Drill matures, users would appreciate the ability to use JDBC
> > and ODBC drivers with multiple Drill versions. If a shop has 1000
> desktops
> > using the drivers against five Drill clusters, it is impractical to
> upgrade
> > everything in one go.
> > >
> > > Paul: You hit the nail on the head: conversion to Arrow would justify a
> > jump to "Drill 2.0" to explain the required big-bang upgrade (and, to
> > highlight the cool new capabilities that come with Arrow.)
> > >
> >



-- 


*Jim Scott*Mobile/Text | +1 (989) 450-0212
[image: MapR logo]
<https://mapr.com/?utm_source=signature&utm_medium=email&utm_campaign=mapr-logo>

Re: "Crude-but-effective" Arrow integration

Posted by Charles Givre <cg...@gmail.com>.

Hi Aman, 
Thanks for sending.  I looked through the slides and really liked the presentation.  
@Paul, how would a Drill-to-Arrow bridge work exactly?  Would it require serialization/deserialization of Drill objects?  
—C

> On Jan 30, 2019, at 02:16, Paul Rogers <pa...@yahoo.com.INVALID> wrote:
> 
> Hi Aman,
> 
> Thanks for sharing the update. Glad to hear things are still percolating.
> 
> I think Drill is an under appreciated treasure for doing queries in the complex systems that folks seem to be building today. The ability to read multiple data sources is something that maybe only Spark can do as well. (And Spark can't act as a general purpose query engine like Drill can.) Adding Arrow support for input and output would build on this advantage.
> 
> I wonder if the output (client) side might be a great first start. Could be build as a separate app just by combining Arrow and the Drill client code together. Would let lots of Arrow-aware apps query data with Drill rather than having to write their own readers, own filters, own aggregators and, in the end, their own query engine.
> 
> Charles was asking about Summer of Code ideas. This might be one: a stand-alone Drill-to-arrow bridge. I think Arrow has an RPC layer. Add that and any Arrow tool in any language could talk to Drill via the bridge.
> 
> Thanks,
> - Paul
> 
> 
> 
>    On Tuesday, January 29, 2019, 1:54:30 PM PST, Aman Sinha <am...@gmail.com> wrote:  
> 
> Hi Charles,
> You may have seen the talk that was given on the Drill Developer Day [1] by
> Karthik and me ... look for the slides on 'Drill-Arrow Integration' which
> describes 2 high level options and what the integration might entail.
> Option 1 corresponds to what you and Paul are discussing in this thread.
> Option 2 is the deeper integration.  We do plan to work on one of them (not
> finalized yet) but it will likely be after 1.16.0 since Statistics support
> and Resource Manager related tasks (these were also discussed in the
> Developer Day) are consuming our time.  If you are interested in
> contributing/collaborating, let me know.
> 
> [1]
> https://drive.google.com/drive/folders/17I2jZq2HdDwUDXFOIg1Vecry8yGTDWhn
> 
> Aman
> 
> On Tue, Jan 29, 2019 at 12:08 AM Paul Rogers <pa...@yahoo.com.invalid>
> wrote:
> 
>> Hi Charles,
>> I didn't see anything on this on the public mailing list. Haven't seen any
>> commits related to it either. My guess is that this kind of interface is
>> not important for the kind of data warehouse use cases that MapR is
>> probably still trying to capture.
>> I followed the Arrow mailing lists for much of last year. Not much
>> activity in the Java arena. (I think most of that might be done by Dremio.)
>> Most activity in other languages. The code itself has drifted far away from
>> the original Drill structure. I found that even the metadata had vastly
>> changed; turned out to be far too much work to port the "Row Set" stuff I
>> did for Drill.
>> This does mean, BTW, that the Drill folks did the right thing by not
>> following Arrow. They'd have spend a huge amount of time tracking the
>> massive changes.
>> Still, converting Arrow vectors to Drill vectors might be an exercise in
>> bit twirling and memory ownership. Harder now than it once was since I
>> think Arrow defines all vectors to be nullable, and uses a different scheme
>> than Drill for representing nulls.
>> Thanks,
>> - Paul
>> 
>> 
>> 
>>     On Monday, January 28, 2019, 5:54:12 PM PST, Charles Givre <
>> cgivre@gmail.com> wrote:
>> 
>>   Hey Paul,
>> I’m curious as to what, if anything ever came of this thread?  IMHO,
>> you’re on to something here.  We could get the benefit of
>> Arrow—specifically the interoperability with other big data tools—without
>> the pain of having to completely re-work Drill. This seems like a real
>> win-win to me.
>> — C
>> 
>>> On Aug 20, 2018, at 13:51, Paul Rogers <pa...@yahoo.com.INVALID>
>> wrote:
>>> 
>>> Hi Ted,
>>> 
>>> We may be confusing two very different ideas. The one is a
>> Drill-to-Arrow adapter on Drill's periphery, this is the
>> "crude-but-effective" integration suggestion. On the periphery we are not
>> changing existing code, we're just building an adapter to read Arrow data
>> into Drill, or convert Drill output to Arrow.
>>> 
>>> The other idea, being discussed in a parallel thread, is to convert
>> Drill's runtime engine to use Arrow. That is a whole other beast.
>>> 
>>> When changing Drill internals, code must change. There is a cost
>> associated with that. Whether the Arrow code is better or not is not the
>> key question. Rather, the key question is simply the volume of changes.
>>> 
>>> Drill divides into roughly two main layers: plan-time and run-time.
>> Plan-time is not much affected by Arrow. But, run-time code is all about
>> manipulating vectors and their metadata, often in quite detailed ways with
>> APIs unique to Drill. While swapping Arrow vectors for Drill vectors is
>> conceptually simple, those of us who've looked at the details have noted
>> that the sheer volume of the lines of code that must change is daunting.
>>> 
>>> Would be good to get second options. That PR I mentioned will show the
>> volume of code that changed at that time (but Drill has grown since then.)
>> Parth is another good resource as he reviewed the original PR and has kept
>> a close eye on Arrow.
>>> 
>>> When considering Arrow in the Drill execution engine, we must
>> realistically understand the cost then ask, do the benefits we gain justify
>> those costs? Would Arrow be the highest-priority investment? Frankly, would
>> Arrow integration increase Drill adoption more than the many other topics
>> discussed recently on these mail lists?
>>> 
>>> Charles and others make a strong case for Arrow for integration. What is
>> the strong case for Drill's internals? That's really the question the group
>> will want to answer.
>>> 
>>> More details below.
>>> 
>>> Thanks,
>>> - Paul
>>> 
>>> 
>>> 
>>>     On Monday, August 20, 2018, 9:41:49 AM PDT, Ted Dunning <
>> ted.dunning@gmail.com> wrote:
>>> 
>>> Inline.
>>> 
>>> 
>>> On Mon, Aug 20, 2018 at 9:20 AM Paul Rogers <pa...@yahoo.com.invalid>
>>> wrote:
>>> 
>>>> ...
>>>> By contrast, migrating Drill internals to Arrow has always been seen as
>>>> the bulk of the cost; costs which the "crude-but-effective" suggestion
>>>> seeks to avoid. Some of the full-integration costs include:
>>>> 
>>>> * Reworking Drill's direct memory model to work with Arrow's.
>>>> 
>>> 
>>> 
>>> Ted: This should be relatively isolated to the allocation/deallocation
>> code. The
>>> deallocation should become a no-op. The allocation becomes simpler and
>>> safer.
>>> 
>>> Paul: If only that were true. Drill has an ingenious integration of
>> vector allocation and Netty. Arrow may have done the same. (Probably did,
>> since such integration is key to avoiding copies on send/receive.). That
>> code is highly complex. Clearly, the swap can be done; it will simply take
>> some work to get right.
>>> 
>>> 
>>>> * Changing all low-level runtime code that works with vectors to instead
>>>> work with Arrow vectors.
>>>> 
>>> 
>>> 
>>> Ted: Why? You already said that most code doesn't have to change since
>> the
>>> format is the same.
>>> 
>>> Paul: My comment about the format being the same was that the direct
>> memory layout is the same, allowing conversion of a Drill vector to an
>> Arrow vector by relabeling the direct memory that holds the data.
>>> 
>>> Paul: But, in the Drill runtime engine, we don't work with the memory
>> directly, we use the vector APIs, mutator APIs and so on. These all changed
>> in Arrow. Granted, the Arrow versions are cleaner. But, that does mean that
>> every vector reference (of which there are thousands) must be revised to
>> use the Arrow APIs. That is the cost that has put us off a bit.
>>> 
>>> 
>>>> * Change all Drill's vector metadata, and code that uses that metadata,
>> to
>>>> use Arrow's metadata instead.
>>>> 
>>> 
>>> 
>>> Ted: Why? You said that converting Arrow metadata to Drill's metadata
>> would be
>>> simple. Why not just continue with that?
>>> 
>>> Paul: In an API, we can convert one data structure to the other by
>> writing code to copy data. But, if we change Drill's internals, we must
>> rewrite code in every operator that uses Drill's metadata to instead use
>> Arrows. That is a much more extensive undertaking than simply converting
>> metadata on input or output.
>>> 
>>> 
>>>> * Since generated code works directly with vectors, change all the code
>>>> generation.
>>>> 
>>> 
>>> Ted: Why? You said the UDFs would just work.
>>> 
>>> Paul: Again, I fear we are confusing two issues. If we don't change
>> Drill's internals, then UDFs will work as today. If we do change Drill to
>> Arrow, then, since UDFs are part of the code gen system, they must change
>> to adapt to the Arrow APIs. Specially, Drill "holders" must be converted to
>> Arrow holders. Drill complex writers must convert to Arrow complex writers.
>>> 
>>> Paul: Here I'll point out that the Arrow vector code and writers have
>> the same uncontrolled memory flaw that they inherited from Drill. So, if we
>> replace the mutators and writers, we might as well use the "result set
>> loader" model which a) hides the details, and b) manages memory to a given
>> budget.  Either way, UDFs must change if we move to Arrow for Drill
>> internals.
>>> 
>>> 
>>>> * Since Drill vectors and metadata are exposed via the Drill client to
>>>> JDBC and ODBC, those must be revised as well.
>>>> 
>>> 
>>> Ted: How much given the high level of compatibility?
>>> 
>>> Paul: As with Drill internals, all JDBC/ODBC code that uses Drill vector
>> and metadata classes must be revised to use Arrow vectors and metadata,
>> adapting the code to the changed APIs. This is not a huge technical
>> challenge, it is just a pile of work. Perhaps this was done in that Arrow
>> conversion PR.
>>> 
>>> 
>>> 
>>>> * Since the wire format will change, clients of Drill must upgrade their
>>>> JDBC/ODBC drivers when migrating to an Arrow-based Drill.>
>>> 
>>> 
>>> Ted: Doesn't this have to happen fairly often anyway?
>>> 
>>> Ted: Perhaps this would be a good excuse for a 2.0 step.
>>> 
>>> Paul: As Drill matures, users would appreciate the ability to use JDBC
>> and ODBC drivers with multiple Drill versions. If a shop has 1000 desktops
>> using the drivers against five Drill clusters, it is impractical to upgrade
>> everything in one go.
>>> 
>>> Paul: You hit the nail on the head: conversion to Arrow would justify a
>> jump to "Drill 2.0" to explain the required big-bang upgrade (and, to
>> highlight the cool new capabilities that come with Arrow.)
>>>

Re: "Crude-but-effective" Arrow integration

Posted by Paul Rogers <pa...@yahoo.com.INVALID>.

Hi Aman,

Thanks for sharing the update. Glad to hear things are still percolating.

I think Drill is an under appreciated treasure for doing queries in the complex systems that folks seem to be building today. The ability to read multiple data sources is something that maybe only Spark can do as well. (And Spark can't act as a general purpose query engine like Drill can.) Adding Arrow support for input and output would build on this advantage.

I wonder if the output (client) side might be a great first start. Could be build as a separate app just by combining Arrow and the Drill client code together. Would let lots of Arrow-aware apps query data with Drill rather than having to write their own readers, own filters, own aggregators and, in the end, their own query engine.

Charles was asking about Summer of Code ideas. This might be one: a stand-alone Drill-to-arrow bridge. I think Arrow has an RPC layer. Add that and any Arrow tool in any language could talk to Drill via the bridge.

Thanks,
- Paul

 

    On Tuesday, January 29, 2019, 1:54:30 PM PST, Aman Sinha <am...@gmail.com> wrote:  
 
 Hi Charles,
You may have seen the talk that was given on the Drill Developer Day [1] by
Karthik and me ... look for the slides on 'Drill-Arrow Integration' which
describes 2 high level options and what the integration might entail.
Option 1 corresponds to what you and Paul are discussing in this thread.
Option 2 is the deeper integration.  We do plan to work on one of them (not
finalized yet) but it will likely be after 1.16.0 since Statistics support
and Resource Manager related tasks (these were also discussed in the
Developer Day) are consuming our time.  If you are interested in
contributing/collaborating, let me know.

[1]
https://drive.google.com/drive/folders/17I2jZq2HdDwUDXFOIg1Vecry8yGTDWhn

Aman

On Tue, Jan 29, 2019 at 12:08 AM Paul Rogers <pa...@yahoo.com.invalid>
wrote:

> Hi Charles,
> I didn't see anything on this on the public mailing list. Haven't seen any
> commits related to it either. My guess is that this kind of interface is
> not important for the kind of data warehouse use cases that MapR is
> probably still trying to capture.
> I followed the Arrow mailing lists for much of last year. Not much
> activity in the Java arena. (I think most of that might be done by Dremio.)
> Most activity in other languages. The code itself has drifted far away from
> the original Drill structure. I found that even the metadata had vastly
> changed; turned out to be far too much work to port the "Row Set" stuff I
> did for Drill.
> This does mean, BTW, that the Drill folks did the right thing by not
> following Arrow. They'd have spend a huge amount of time tracking the
> massive changes.
> Still, converting Arrow vectors to Drill vectors might be an exercise in
> bit twirling and memory ownership. Harder now than it once was since I
> think Arrow defines all vectors to be nullable, and uses a different scheme
> than Drill for representing nulls.
> Thanks,
> - Paul
>
>
>
>    On Monday, January 28, 2019, 5:54:12 PM PST, Charles Givre <
> cgivre@gmail.com> wrote:
>
>  Hey Paul,
> I’m curious as to what, if anything ever came of this thread?  IMHO,
> you’re on to something here.  We could get the benefit of
> Arrow—specifically the interoperability with other big data tools—without
> the pain of having to completely re-work Drill. This seems like a real
> win-win to me.
> — C
>
> > On Aug 20, 2018, at 13:51, Paul Rogers <pa...@yahoo.com.INVALID>
> wrote:
> >
> > Hi Ted,
> >
> > We may be confusing two very different ideas. The one is a
> Drill-to-Arrow adapter on Drill's periphery, this is the
> "crude-but-effective" integration suggestion. On the periphery we are not
> changing existing code, we're just building an adapter to read Arrow data
> into Drill, or convert Drill output to Arrow.
> >
> > The other idea, being discussed in a parallel thread, is to convert
> Drill's runtime engine to use Arrow. That is a whole other beast.
> >
> > When changing Drill internals, code must change. There is a cost
> associated with that. Whether the Arrow code is better or not is not the
> key question. Rather, the key question is simply the volume of changes.
> >
> > Drill divides into roughly two main layers: plan-time and run-time.
> Plan-time is not much affected by Arrow. But, run-time code is all about
> manipulating vectors and their metadata, often in quite detailed ways with
> APIs unique to Drill. While swapping Arrow vectors for Drill vectors is
> conceptually simple, those of us who've looked at the details have noted
> that the sheer volume of the lines of code that must change is daunting.
> >
> > Would be good to get second options. That PR I mentioned will show the
> volume of code that changed at that time (but Drill has grown since then.)
> Parth is another good resource as he reviewed the original PR and has kept
> a close eye on Arrow.
> >
> > When considering Arrow in the Drill execution engine, we must
> realistically understand the cost then ask, do the benefits we gain justify
> those costs? Would Arrow be the highest-priority investment? Frankly, would
> Arrow integration increase Drill adoption more than the many other topics
> discussed recently on these mail lists?
> >
> > Charles and others make a strong case for Arrow for integration. What is
> the strong case for Drill's internals? That's really the question the group
> will want to answer.
> >
> > More details below.
> >
> > Thanks,
> > - Paul
> >
> >
> >
> >    On Monday, August 20, 2018, 9:41:49 AM PDT, Ted Dunning <
> ted.dunning@gmail.com> wrote:
> >
> > Inline.
> >
> >
> > On Mon, Aug 20, 2018 at 9:20 AM Paul Rogers <pa...@yahoo.com.invalid>
> > wrote:
> >
> >> ...
> >> By contrast, migrating Drill internals to Arrow has always been seen as
> >> the bulk of the cost; costs which the "crude-but-effective" suggestion
> >> seeks to avoid. Some of the full-integration costs include:
> >>
> >> * Reworking Drill's direct memory model to work with Arrow's.
> >>
> >
> >
> > Ted: This should be relatively isolated to the allocation/deallocation
> code. The
> > deallocation should become a no-op. The allocation becomes simpler and
> > safer.
> >
> > Paul: If only that were true. Drill has an ingenious integration of
> vector allocation and Netty. Arrow may have done the same. (Probably did,
> since such integration is key to avoiding copies on send/receive.). That
> code is highly complex. Clearly, the swap can be done; it will simply take
> some work to get right.
> >
> >
> >> * Changing all low-level runtime code that works with vectors to instead
> >> work with Arrow vectors.
> >>
> >
> >
> > Ted: Why? You already said that most code doesn't have to change since
> the
> > format is the same.
> >
> > Paul: My comment about the format being the same was that the direct
> memory layout is the same, allowing conversion of a Drill vector to an
> Arrow vector by relabeling the direct memory that holds the data.
> >
> > Paul: But, in the Drill runtime engine, we don't work with the memory
> directly, we use the vector APIs, mutator APIs and so on. These all changed
> in Arrow. Granted, the Arrow versions are cleaner. But, that does mean that
> every vector reference (of which there are thousands) must be revised to
> use the Arrow APIs. That is the cost that has put us off a bit.
> >
> >
> >> * Change all Drill's vector metadata, and code that uses that metadata,
> to
> >> use Arrow's metadata instead.
> >>
> >
> >
> > Ted: Why? You said that converting Arrow metadata to Drill's metadata
> would be
> > simple. Why not just continue with that?
> >
> > Paul: In an API, we can convert one data structure to the other by
> writing code to copy data. But, if we change Drill's internals, we must
> rewrite code in every operator that uses Drill's metadata to instead use
> Arrows. That is a much more extensive undertaking than simply converting
> metadata on input or output.
> >
> >
> >> * Since generated code works directly with vectors, change all the code
> >> generation.
> >>
> >
> > Ted: Why? You said the UDFs would just work.
> >
> > Paul: Again, I fear we are confusing two issues. If we don't change
> Drill's internals, then UDFs will work as today. If we do change Drill to
> Arrow, then, since UDFs are part of the code gen system, they must change
> to adapt to the Arrow APIs. Specially, Drill "holders" must be converted to
> Arrow holders. Drill complex writers must convert to Arrow complex writers.
> >
> > Paul: Here I'll point out that the Arrow vector code and writers have
> the same uncontrolled memory flaw that they inherited from Drill. So, if we
> replace the mutators and writers, we might as well use the "result set
> loader" model which a) hides the details, and b) manages memory to a given
> budget.  Either way, UDFs must change if we move to Arrow for Drill
> internals.
> >
> >
> >> * Since Drill vectors and metadata are exposed via the Drill client to
> >> JDBC and ODBC, those must be revised as well.
> >>
> >
> > Ted: How much given the high level of compatibility?
> >
> > Paul: As with Drill internals, all JDBC/ODBC code that uses Drill vector
> and metadata classes must be revised to use Arrow vectors and metadata,
> adapting the code to the changed APIs. This is not a huge technical
> challenge, it is just a pile of work. Perhaps this was done in that Arrow
> conversion PR.
> >
> >
> >
> >> * Since the wire format will change, clients of Drill must upgrade their
> >> JDBC/ODBC drivers when migrating to an Arrow-based Drill.>
> >
> >
> > Ted: Doesn't this have to happen fairly often anyway?
> >
> > Ted: Perhaps this would be a good excuse for a 2.0 step.
> >
> > Paul: As Drill matures, users would appreciate the ability to use JDBC
> and ODBC drivers with multiple Drill versions. If a shop has 1000 desktops
> using the drivers against five Drill clusters, it is impractical to upgrade
> everything in one go.
> >
> > Paul: You hit the nail on the head: conversion to Arrow would justify a
> jump to "Drill 2.0" to explain the required big-bang upgrade (and, to
> highlight the cool new capabilities that come with Arrow.)
> >
>

Re: "Crude-but-effective" Arrow integration

Posted by Aman Sinha <am...@gmail.com>.

Hi Charles,
You may have seen the talk that was given on the Drill Developer Day [1] by
Karthik and me ... look for the slides on 'Drill-Arrow Integration' which
describes 2 high level options and what the integration might entail.
Option 1 corresponds to what you and Paul are discussing in this thread.
Option 2 is the deeper integration.  We do plan to work on one of them (not
finalized yet) but it will likely be after 1.16.0 since Statistics support
and Resource Manager related tasks (these were also discussed in the
Developer Day) are consuming our time.   If you are interested in
contributing/collaborating, let me know.

[1]
https://drive.google.com/drive/folders/17I2jZq2HdDwUDXFOIg1Vecry8yGTDWhn

Aman

On Tue, Jan 29, 2019 at 12:08 AM Paul Rogers <pa...@yahoo.com.invalid>
wrote:

> Hi Charles,
> I didn't see anything on this on the public mailing list. Haven't seen any
> commits related to it either. My guess is that this kind of interface is
> not important for the kind of data warehouse use cases that MapR is
> probably still trying to capture.
> I followed the Arrow mailing lists for much of last year. Not much
> activity in the Java arena. (I think most of that might be done by Dremio.)
> Most activity in other languages. The code itself has drifted far away from
> the original Drill structure. I found that even the metadata had vastly
> changed; turned out to be far too much work to port the "Row Set" stuff I
> did for Drill.
> This does mean, BTW, that the Drill folks did the right thing by not
> following Arrow. They'd have spend a huge amount of time tracking the
> massive changes.
> Still, converting Arrow vectors to Drill vectors might be an exercise in
> bit twirling and memory ownership. Harder now than it once was since I
> think Arrow defines all vectors to be nullable, and uses a different scheme
> than Drill for representing nulls.
> Thanks,
> - Paul
>
>
>
>     On Monday, January 28, 2019, 5:54:12 PM PST, Charles Givre <
> cgivre@gmail.com> wrote:
>
>  Hey Paul,
> I’m curious as to what, if anything ever came of this thread?  IMHO,
> you’re on to something here.  We could get the benefit of
> Arrow—specifically the interoperability with other big data tools—without
> the pain of having to completely re-work Drill. This seems like a real
> win-win to me.
> — C
>
> > On Aug 20, 2018, at 13:51, Paul Rogers <pa...@yahoo.com.INVALID>
> wrote:
> >
> > Hi Ted,
> >
> > We may be confusing two very different ideas. The one is a
> Drill-to-Arrow adapter on Drill's periphery, this is the
> "crude-but-effective" integration suggestion. On the periphery we are not
> changing existing code, we're just building an adapter to read Arrow data
> into Drill, or convert Drill output to Arrow.
> >
> > The other idea, being discussed in a parallel thread, is to convert
> Drill's runtime engine to use Arrow. That is a whole other beast.
> >
> > When changing Drill internals, code must change. There is a cost
> associated with that. Whether the Arrow code is better or not is not the
> key question. Rather, the key question is simply the volume of changes.
> >
> > Drill divides into roughly two main layers: plan-time and run-time.
> Plan-time is not much affected by Arrow. But, run-time code is all about
> manipulating vectors and their metadata, often in quite detailed ways with
> APIs unique to Drill. While swapping Arrow vectors for Drill vectors is
> conceptually simple, those of us who've looked at the details have noted
> that the sheer volume of the lines of code that must change is daunting.
> >
> > Would be good to get second options. That PR I mentioned will show the
> volume of code that changed at that time (but Drill has grown since then.)
> Parth is another good resource as he reviewed the original PR and has kept
> a close eye on Arrow.
> >
> > When considering Arrow in the Drill execution engine, we must
> realistically understand the cost then ask, do the benefits we gain justify
> those costs? Would Arrow be the highest-priority investment? Frankly, would
> Arrow integration increase Drill adoption more than the many other topics
> discussed recently on these mail lists?
> >
> > Charles and others make a strong case for Arrow for integration. What is
> the strong case for Drill's internals? That's really the question the group
> will want to answer.
> >
> > More details below.
> >
> > Thanks,
> > - Paul
> >
> >
> >
> >    On Monday, August 20, 2018, 9:41:49 AM PDT, Ted Dunning <
> ted.dunning@gmail.com> wrote:
> >
> > Inline.
> >
> >
> > On Mon, Aug 20, 2018 at 9:20 AM Paul Rogers <pa...@yahoo.com.invalid>
> > wrote:
> >
> >> ...
> >> By contrast, migrating Drill internals to Arrow has always been seen as
> >> the bulk of the cost; costs which the "crude-but-effective" suggestion
> >> seeks to avoid. Some of the full-integration costs include:
> >>
> >> * Reworking Drill's direct memory model to work with Arrow's.
> >>
> >
> >
> > Ted: This should be relatively isolated to the allocation/deallocation
> code. The
> > deallocation should become a no-op. The allocation becomes simpler and
> > safer.
> >
> > Paul: If only that were true. Drill has an ingenious integration of
> vector allocation and Netty. Arrow may have done the same. (Probably did,
> since such integration is key to avoiding copies on send/receive.). That
> code is highly complex. Clearly, the swap can be done; it will simply take
> some work to get right.
> >
> >
> >> * Changing all low-level runtime code that works with vectors to instead
> >> work with Arrow vectors.
> >>
> >
> >
> > Ted: Why? You already said that most code doesn't have to change since
> the
> > format is the same.
> >
> > Paul: My comment about the format being the same was that the direct
> memory layout is the same, allowing conversion of a Drill vector to an
> Arrow vector by relabeling the direct memory that holds the data.
> >
> > Paul: But, in the Drill runtime engine, we don't work with the memory
> directly, we use the vector APIs, mutator APIs and so on. These all changed
> in Arrow. Granted, the Arrow versions are cleaner. But, that does mean that
> every vector reference (of which there are thousands) must be revised to
> use the Arrow APIs. That is the cost that has put us off a bit.
> >
> >
> >> * Change all Drill's vector metadata, and code that uses that metadata,
> to
> >> use Arrow's metadata instead.
> >>
> >
> >
> > Ted: Why? You said that converting Arrow metadata to Drill's metadata
> would be
> > simple. Why not just continue with that?
> >
> > Paul: In an API, we can convert one data structure to the other by
> writing code to copy data. But, if we change Drill's internals, we must
> rewrite code in every operator that uses Drill's metadata to instead use
> Arrows. That is a much more extensive undertaking than simply converting
> metadata on input or output.
> >
> >
> >> * Since generated code works directly with vectors, change all the code
> >> generation.
> >>
> >
> > Ted: Why? You said the UDFs would just work.
> >
> > Paul: Again, I fear we are confusing two issues. If we don't change
> Drill's internals, then UDFs will work as today. If we do change Drill to
> Arrow, then, since UDFs are part of the code gen system, they must change
> to adapt to the Arrow APIs. Specially, Drill "holders" must be converted to
> Arrow holders. Drill complex writers must convert to Arrow complex writers.
> >
> > Paul: Here I'll point out that the Arrow vector code and writers have
> the same uncontrolled memory flaw that they inherited from Drill. So, if we
> replace the mutators and writers, we might as well use the "result set
> loader" model which a) hides the details, and b) manages memory to a given
> budget.  Either way, UDFs must change if we move to Arrow for Drill
> internals.
> >
> >
> >> * Since Drill vectors and metadata are exposed via the Drill client to
> >> JDBC and ODBC, those must be revised as well.
> >>
> >
> > Ted: How much given the high level of compatibility?
> >
> > Paul: As with Drill internals, all JDBC/ODBC code that uses Drill vector
> and metadata classes must be revised to use Arrow vectors and metadata,
> adapting the code to the changed APIs. This is not a huge technical
> challenge, it is just a pile of work. Perhaps this was done in that Arrow
> conversion PR.
> >
> >
> >
> >> * Since the wire format will change, clients of Drill must upgrade their
> >> JDBC/ODBC drivers when migrating to an Arrow-based Drill.>
> >
> >
> > Ted: Doesn't this have to happen fairly often anyway?
> >
> > Ted: Perhaps this would be a good excuse for a 2.0 step.
> >
> > Paul: As Drill matures, users would appreciate the ability to use JDBC
> and ODBC drivers with multiple Drill versions. If a shop has 1000 desktops
> using the drivers against five Drill clusters, it is impractical to upgrade
> everything in one go.
> >
> > Paul: You hit the nail on the head: conversion to Arrow would justify a
> jump to "Drill 2.0" to explain the required big-bang upgrade (and, to
> highlight the cool new capabilities that come with Arrow.)
> >
>

Re: "Crude-but-effective" Arrow integration

Posted by Paul Rogers <pa...@yahoo.com.INVALID>.

Hi Charles,
I didn't see anything on this on the public mailing list. Haven't seen any commits related to it either. My guess is that this kind of interface is not important for the kind of data warehouse use cases that MapR is probably still trying to capture.
I followed the Arrow mailing lists for much of last year. Not much activity in the Java arena. (I think most of that might be done by Dremio.) Most activity in other languages. The code itself has drifted far away from the original Drill structure. I found that even the metadata had vastly changed; turned out to be far too much work to port the "Row Set" stuff I did for Drill.
This does mean, BTW, that the Drill folks did the right thing by not following Arrow. They'd have spend a huge amount of time tracking the massive changes.
Still, converting Arrow vectors to Drill vectors might be an exercise in bit twirling and memory ownership. Harder now than it once was since I think Arrow defines all vectors to be nullable, and uses a different scheme than Drill for representing nulls.
Thanks,
- Paul

 

    On Monday, January 28, 2019, 5:54:12 PM PST, Charles Givre <cg...@gmail.com> wrote:  
 
 Hey Paul, 
I’m curious as to what, if anything ever came of this thread?  IMHO, you’re on to something here.  We could get the benefit of Arrow—specifically the interoperability with other big data tools—without the pain of having to completely re-work Drill. This seems like a real win-win to me. 
— C

> On Aug 20, 2018, at 13:51, Paul Rogers <pa...@yahoo.com.INVALID> wrote:
> 
> Hi Ted,
> 
> We may be confusing two very different ideas. The one is a Drill-to-Arrow adapter on Drill's periphery, this is the "crude-but-effective" integration suggestion. On the periphery we are not changing existing code, we're just building an adapter to read Arrow data into Drill, or convert Drill output to Arrow.
> 
> The other idea, being discussed in a parallel thread, is to convert Drill's runtime engine to use Arrow. That is a whole other beast.
> 
> When changing Drill internals, code must change. There is a cost associated with that. Whether the Arrow code is better or not is not the key question. Rather, the key question is simply the volume of changes.
> 
> Drill divides into roughly two main layers: plan-time and run-time. Plan-time is not much affected by Arrow. But, run-time code is all about manipulating vectors and their metadata, often in quite detailed ways with APIs unique to Drill. While swapping Arrow vectors for Drill vectors is conceptually simple, those of us who've looked at the details have noted that the sheer volume of the lines of code that must change is daunting.
> 
> Would be good to get second options. That PR I mentioned will show the volume of code that changed at that time (but Drill has grown since then.) Parth is another good resource as he reviewed the original PR and has kept a close eye on Arrow.
> 
> When considering Arrow in the Drill execution engine, we must realistically understand the cost then ask, do the benefits we gain justify those costs? Would Arrow be the highest-priority investment? Frankly, would Arrow integration increase Drill adoption more than the many other topics discussed recently on these mail lists?
> 
> Charles and others make a strong case for Arrow for integration. What is the strong case for Drill's internals? That's really the question the group will want to answer.
> 
> More details below.
> 
> Thanks,
> - Paul
> 
> 
> 
>    On Monday, August 20, 2018, 9:41:49 AM PDT, Ted Dunning <te...@gmail.com> wrote:  
> 
> Inline.
> 
> 
> On Mon, Aug 20, 2018 at 9:20 AM Paul Rogers <pa...@yahoo.com.invalid>
> wrote:
> 
>> ...
>> By contrast, migrating Drill internals to Arrow has always been seen as
>> the bulk of the cost; costs which the "crude-but-effective" suggestion
>> seeks to avoid. Some of the full-integration costs include:
>> 
>> * Reworking Drill's direct memory model to work with Arrow's.
>> 
> 
> 
> Ted: This should be relatively isolated to the allocation/deallocation code. The
> deallocation should become a no-op. The allocation becomes simpler and
> safer.
> 
> Paul: If only that were true. Drill has an ingenious integration of vector allocation and Netty. Arrow may have done the same. (Probably did, since such integration is key to avoiding copies on send/receive.). That code is highly complex. Clearly, the swap can be done; it will simply take some work to get right.
> 
> 
>> * Changing all low-level runtime code that works with vectors to instead
>> work with Arrow vectors.
>> 
> 
> 
> Ted: Why? You already said that most code doesn't have to change since the
> format is the same.
> 
> Paul: My comment about the format being the same was that the direct memory layout is the same, allowing conversion of a Drill vector to an Arrow vector by relabeling the direct memory that holds the data.
> 
> Paul: But, in the Drill runtime engine, we don't work with the memory directly, we use the vector APIs, mutator APIs and so on. These all changed in Arrow. Granted, the Arrow versions are cleaner. But, that does mean that every vector reference (of which there are thousands) must be revised to use the Arrow APIs. That is the cost that has put us off a bit.
> 
> 
>> * Change all Drill's vector metadata, and code that uses that metadata, to
>> use Arrow's metadata instead.
>> 
> 
> 
> Ted: Why? You said that converting Arrow metadata to Drill's metadata would be
> simple. Why not just continue with that?
> 
> Paul: In an API, we can convert one data structure to the other by writing code to copy data. But, if we change Drill's internals, we must rewrite code in every operator that uses Drill's metadata to instead use Arrows. That is a much more extensive undertaking than simply converting metadata on input or output.
> 
> 
>> * Since generated code works directly with vectors, change all the code
>> generation.
>> 
> 
> Ted: Why? You said the UDFs would just work.
> 
> Paul: Again, I fear we are confusing two issues. If we don't change Drill's internals, then UDFs will work as today. If we do change Drill to Arrow, then, since UDFs are part of the code gen system, they must change to adapt to the Arrow APIs. Specially, Drill "holders" must be converted to Arrow holders. Drill complex writers must convert to Arrow complex writers.
> 
> Paul: Here I'll point out that the Arrow vector code and writers have the same uncontrolled memory flaw that they inherited from Drill. So, if we replace the mutators and writers, we might as well use the "result set loader" model which a) hides the details, and b) manages memory to a given budget.  Either way, UDFs must change if we move to Arrow for Drill internals.
> 
> 
>> * Since Drill vectors and metadata are exposed via the Drill client to
>> JDBC and ODBC, those must be revised as well.
>> 
> 
> Ted: How much given the high level of compatibility?
> 
> Paul: As with Drill internals, all JDBC/ODBC code that uses Drill vector and metadata classes must be revised to use Arrow vectors and metadata, adapting the code to the changed APIs. This is not a huge technical challenge, it is just a pile of work. Perhaps this was done in that Arrow conversion PR.
> 
> 
> 
>> * Since the wire format will change, clients of Drill must upgrade their
>> JDBC/ODBC drivers when migrating to an Arrow-based Drill.>
> 
> 
> Ted: Doesn't this have to happen fairly often anyway?
> 
> Ted: Perhaps this would be a good excuse for a 2.0 step.
> 
> Paul: As Drill matures, users would appreciate the ability to use JDBC and ODBC drivers with multiple Drill versions. If a shop has 1000 desktops using the drivers against five Drill clusters, it is impractical to upgrade everything in one go.
> 
> Paul: You hit the nail on the head: conversion to Arrow would justify a jump to "Drill 2.0" to explain the required big-bang upgrade (and, to highlight the cool new capabilities that come with Arrow.)
>