You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@drill.apache.org by Jacques Nadeau <ja...@dremio.com> on 2015/10/26 22:19:44 UTC

[DISCUSS] Proposal to turn ValueVectors into separate reusable library & project

Drillers,



A number of people have approached me recently about the possibility of
collaborating on a shared columnar in-memory representation of data. This
shared representation of data could be operated on efficiently with modern
cpus as well as shared efficiently via shared memory, IPC and RPC. This
would allow multiple applications to work together at high speed. Examples
include moving back and forth between a library.



As I was discussing these ideas with people working on projects including
Calcite, Ibis, Kudu, Storm, Herron, Parquet and products from companies
like MapR and Trifacta, it became clear that much of what the Drill
community has already constructed is very relevant to the goals of a new
broader interchange and execution format. (In fact, Ted and I actually
informally discussed extracting this functionality as a library more than
two years ago.)



A standard will emerge around this need and it is in the best interest of
the Drill community and the broader ecosystem if Drill’s ValueVectors
concepts and code form the basis of a new library/collaboration/project.
This means better interoperability, shared responsibility around
maintenance and development and the avoidance of further division of the
ecosystem.



A little background for some: Drill is the first project to create a
powerful language agnostic in-memory representation of complex columnar
data. We've learned a lot over the last three years about how to interface
with these structures, manage memory associated with them, adjust their
sizes, expose them in builder patterns, etc. That work is useful for a
number of systems and it would be great if we could share the learning. By
creating a new, well documented and collaborative library, people could
leverage this functionality in wider range of applications and systems.



I’ve seen the great success that libraries like Parquet and Calcite have
been able to achieve due to their focus on APIs, extensibility and
reusability and I think we could do the same with the Drill ValueVector
codebase. The fact that this would allow higher speed interchange among
many other systems and becoming the standard for in-memory columnar
exchange (as opposed to having to adopt an external standard) makes this a
great opportunity to both benefit the Drill community and give back to the
broader Apache community.



As such, I’d like to open a discussion about taking this path. I think
there would be various avenues of how to do this but my initial proposal
would be to propose this as a new project that goes straight to a
provisional TLP. We then would work to clean up layer responsibilities and
extract pieces of the code into this new project where we collaborate with
a wider group on a broader implementation (and more formal specification).


Given the conversations I have had and the excitement and need for this, I
think we should do this. If the community is supportive, we could probably
see some really cool integrations around things like high-speed Python
machine learning inside Drill operators before the end of the year.



I’ll open a new JIRA and attach it here where we can start a POC &
discussion of how we could extract this code.


Looking forward to feedback!


Jacques


--
Jacques Nadeau
CTO and Co-Founder, Dremio

Re: [DISCUSS] Proposal to turn ValueVectors into separate reusable library & project

Posted by Steven Phillips <st...@dremio.com>.

+1 on merging this soon.

Going forward, I agree it makes sense to break the RPC module into a
stand-alone module that is not specific to drill. But whether it is better
for it live in the Drill project or in the new Vector project, I am not
sure.

On Sun, Nov 8, 2015 at 6:42 PM, Jacques Nadeau <ja...@dremio.com> wrote:

> FYI, the patch also just successfully completed the extended regression
> suite.
>
> --
> Jacques Nadeau
> CTO and Co-Founder, Dremio
>
> On Sun, Nov 8, 2015 at 5:09 PM, Jacques Nadeau <ja...@dremio.com> wrote:
>
> > Ok guys,
> >
> > I took the quiet time directly after the release candidate went out to do
> > the first phase of componentization. You can see my work at [1].
> >
> > This set of commits has little functional impact. I've also done my best
> > to avoid package or file renaming, rather keeping things in their same
> > packages but in different modules (so that other patches are more easily
> > applied). There are nine commits in the branch. They break down into
> three
> > categories: MOVE, REFACTOR & CLEANUP.
> >
> > I've separated the changes out so that it should be reasonably
> > straightforward to review. The MOVE patches are constrained primarily to
> > moving files from module to another.
> >
> > DRILL-3987: (MOVE) Extract key vector, field reader, complex/field wr… …
> > 21cbd84
> > DRILL-3987: (REFACTOR) Common and Vector modules building. … e390db9
> > DRILL-3987: (REFACTOR) Working TPCH unit tests … 2cc1d30
> > DRILL-3987: (MOVE) Extract RPC, memory-base and memory-impl as separa… …
> > d5f3211
> > DRILL-3987: (REFACTOR) Extract BoundsChecking check from AssertionUti… …
> > 83c53d8
> > DRILL-3987: (CLEANUP) Delete unused files 5d596d5
> > DRILL-3987: (REFACTOR) Remove any parent Drill dependencies for drill… …
> > 76f578c
> > DRILL-3987: (MOVE) Move logical expressions and operators out of comm… …
> > f908b8b
> > DRILL-3987: (CLEANUP) Final cleanups to get complete working build/di… …
> > d09aa3b
> >
> > The main goal was to extract a number of separate java-exec submodules.
> > I've also outlined the modularization in a couple slides at [2]. In those
> > slides you'll see that there are some orange dependencies that will need
> to
> > be removed in a second phase of effort. We also need to decide which
> > portions of the third slide at [2] would be appropriate as a separate
> > project versus maintained inside of Drill.
> >
> > Some of the dependencies will need a finer grained hand to separate. The
> > biggest remaining is cleaning up VectorDescriptor, MaterializedField,
> > SerializedField, SchemaPath and FieldReference so that vector can stop
> > depending on the new drill-logical module.
> >
> > My preference would be to merge this straight away as the functional
> > impact is limited and it would be exceedingly difficult to maintain this
> > patch. This patch set provides a complete set of changes for
> modularization
> > and passes all unit tests. I'm running the extended regression suite now
> to
> > confirm no impact on those issues. I don't expect any since the only bugs
> > I've had to track down thus far are drill-module or pom dependency
> issues.
> >
> > Let me know your thoughts.
> >
> > [1] https://github.com/apache/drill/pull/250
> > [2]
> >
> https://docs.google.com/presentation/d/1HD-EzAgNe4EJvoP91ILFLFJdFjT2T5yfM9MEv79BaiM/edit
> >
> >
> > --
> > Jacques Nadeau
> > CTO and Co-Founder, Dremio
> >
> > On Tue, Oct 27, 2015 at 5:59 PM, Jacques Nadeau <ja...@dremio.com>
> > wrote:
> >
> >> Yes, I've started the umbrella @
> >> https://issues.apache.org/jira/browse/DRILL-3986
> >>
> >> And the first sub task: extraction poc @
> >> https://issues.apache.org/jira/browse/DRILL-3987
> >>
> >> I posted some existing materials. I'll start looking at how we can
> >> extract. Would love others thoughts about how we might slice things.
> I'll
> >> post some initial thoughts on the jiras in this regard.
> >>
> >> --
> >> Jacques Nadeau
> >> CTO and Co-Founder, Dremio
> >>
> >> On Tue, Oct 27, 2015 at 5:39 PM, Julian Hyde <jh...@apache.org> wrote:
> >>
> >>> Jacques, Can you please log the JIRA case you mentioned, and also
> attach
> >>> any documentation (e.g. javadoc) you already have.
> >>>
> >>>
> >>
> >
>

Re: [DISCUSS] Proposal to turn ValueVectors into separate reusable library & project

Posted by Jacques Nadeau <ja...@dremio.com>.

FYI, the patch also just successfully completed the extended regression
suite.

--
Jacques Nadeau
CTO and Co-Founder, Dremio

On Sun, Nov 8, 2015 at 5:09 PM, Jacques Nadeau <ja...@dremio.com> wrote:

> Ok guys,
>
> I took the quiet time directly after the release candidate went out to do
> the first phase of componentization. You can see my work at [1].
>
> This set of commits has little functional impact. I've also done my best
> to avoid package or file renaming, rather keeping things in their same
> packages but in different modules (so that other patches are more easily
> applied). There are nine commits in the branch. They break down into three
> categories: MOVE, REFACTOR & CLEANUP.
>
> I've separated the changes out so that it should be reasonably
> straightforward to review. The MOVE patches are constrained primarily to
> moving files from module to another.
>
> DRILL-3987: (MOVE) Extract key vector, field reader, complex/field wr… …
> 21cbd84
> DRILL-3987: (REFACTOR) Common and Vector modules building. … e390db9
> DRILL-3987: (REFACTOR) Working TPCH unit tests … 2cc1d30
> DRILL-3987: (MOVE) Extract RPC, memory-base and memory-impl as separa… …
> d5f3211
> DRILL-3987: (REFACTOR) Extract BoundsChecking check from AssertionUti… …
> 83c53d8
> DRILL-3987: (CLEANUP) Delete unused files 5d596d5
> DRILL-3987: (REFACTOR) Remove any parent Drill dependencies for drill… …
> 76f578c
> DRILL-3987: (MOVE) Move logical expressions and operators out of comm… …
> f908b8b
> DRILL-3987: (CLEANUP) Final cleanups to get complete working build/di… …
> d09aa3b
>
> The main goal was to extract a number of separate java-exec submodules.
> I've also outlined the modularization in a couple slides at [2]. In those
> slides you'll see that there are some orange dependencies that will need to
> be removed in a second phase of effort. We also need to decide which
> portions of the third slide at [2] would be appropriate as a separate
> project versus maintained inside of Drill.
>
> Some of the dependencies will need a finer grained hand to separate. The
> biggest remaining is cleaning up VectorDescriptor, MaterializedField,
> SerializedField, SchemaPath and FieldReference so that vector can stop
> depending on the new drill-logical module.
>
> My preference would be to merge this straight away as the functional
> impact is limited and it would be exceedingly difficult to maintain this
> patch. This patch set provides a complete set of changes for modularization
> and passes all unit tests. I'm running the extended regression suite now to
> confirm no impact on those issues. I don't expect any since the only bugs
> I've had to track down thus far are drill-module or pom dependency issues.
>
> Let me know your thoughts.
>
> [1] https://github.com/apache/drill/pull/250
> [2]
> https://docs.google.com/presentation/d/1HD-EzAgNe4EJvoP91ILFLFJdFjT2T5yfM9MEv79BaiM/edit
>
>
> --
> Jacques Nadeau
> CTO and Co-Founder, Dremio
>
> On Tue, Oct 27, 2015 at 5:59 PM, Jacques Nadeau <ja...@dremio.com>
> wrote:
>
>> Yes, I've started the umbrella @
>> https://issues.apache.org/jira/browse/DRILL-3986
>>
>> And the first sub task: extraction poc @
>> https://issues.apache.org/jira/browse/DRILL-3987
>>
>> I posted some existing materials. I'll start looking at how we can
>> extract. Would love others thoughts about how we might slice things. I'll
>> post some initial thoughts on the jiras in this regard.
>>
>> --
>> Jacques Nadeau
>> CTO and Co-Founder, Dremio
>>
>> On Tue, Oct 27, 2015 at 5:39 PM, Julian Hyde <jh...@apache.org> wrote:
>>
>>> Jacques, Can you please log the JIRA case you mentioned, and also attach
>>> any documentation (e.g. javadoc) you already have.
>>>
>>>
>>
>

Re: [DISCUSS] Proposal to turn ValueVectors into separate reusable library & project

Posted by Jacques Nadeau <ja...@dremio.com>.

Ok guys,

I took the quiet time directly after the release candidate went out to do
the first phase of componentization. You can see my work at [1].

This set of commits has little functional impact. I've also done my best to
avoid package or file renaming, rather keeping things in their same
packages but in different modules (so that other patches are more easily
applied). There are nine commits in the branch. They break down into three
categories: MOVE, REFACTOR & CLEANUP.

I've separated the changes out so that it should be reasonably
straightforward to review. The MOVE patches are constrained primarily to
moving files from module to another.

DRILL-3987: (MOVE) Extract key vector, field reader, complex/field wr… …
21cbd84
DRILL-3987: (REFACTOR) Common and Vector modules building. … e390db9
DRILL-3987: (REFACTOR) Working TPCH unit tests … 2cc1d30
DRILL-3987: (MOVE) Extract RPC, memory-base and memory-impl as separa… …
d5f3211
DRILL-3987: (REFACTOR) Extract BoundsChecking check from AssertionUti… …
83c53d8
DRILL-3987: (CLEANUP) Delete unused files 5d596d5
DRILL-3987: (REFACTOR) Remove any parent Drill dependencies for drill… …
76f578c
DRILL-3987: (MOVE) Move logical expressions and operators out of comm… …
f908b8b
DRILL-3987: (CLEANUP) Final cleanups to get complete working build/di… …
d09aa3b

The main goal was to extract a number of separate java-exec submodules.
I've also outlined the modularization in a couple slides at [2]. In those
slides you'll see that there are some orange dependencies that will need to
be removed in a second phase of effort. We also need to decide which
portions of the third slide at [2] would be appropriate as a separate
project versus maintained inside of Drill.

Some of the dependencies will need a finer grained hand to separate. The
biggest remaining is cleaning up VectorDescriptor, MaterializedField,
SerializedField, SchemaPath and FieldReference so that vector can stop
depending on the new drill-logical module.

My preference would be to merge this straight away as the functional impact
is limited and it would be exceedingly difficult to maintain this patch.
This patch set provides a complete set of changes for modularization and
passes all unit tests. I'm running the extended regression suite now to
confirm no impact on those issues. I don't expect any since the only bugs
I've had to track down thus far are drill-module or pom dependency issues.

Let me know your thoughts.

[1] https://github.com/apache/drill/pull/250
[2]
https://docs.google.com/presentation/d/1HD-EzAgNe4EJvoP91ILFLFJdFjT2T5yfM9MEv79BaiM/edit

--
Jacques Nadeau
CTO and Co-Founder, Dremio

On Tue, Oct 27, 2015 at 5:59 PM, Jacques Nadeau <ja...@dremio.com> wrote:

> Yes, I've started the umbrella @
> https://issues.apache.org/jira/browse/DRILL-3986
>
> And the first sub task: extraction poc @
> https://issues.apache.org/jira/browse/DRILL-3987
>
> I posted some existing materials. I'll start looking at how we can
> extract. Would love others thoughts about how we might slice things. I'll
> post some initial thoughts on the jiras in this regard.
>
> --
> Jacques Nadeau
> CTO and Co-Founder, Dremio
>
> On Tue, Oct 27, 2015 at 5:39 PM, Julian Hyde <jh...@apache.org> wrote:
>
>> Jacques, Can you please log the JIRA case you mentioned, and also attach
>> any documentation (e.g. javadoc) you already have.
>>
>>
>

Re: [DISCUSS] Proposal to turn ValueVectors into separate reusable library & project

Posted by Jacques Nadeau <ja...@dremio.com>.

Yes, I've started the umbrella @
https://issues.apache.org/jira/browse/DRILL-3986

And the first sub task: extraction poc @
https://issues.apache.org/jira/browse/DRILL-3987

I posted some existing materials. I'll start looking at how we can extract.
Would love others thoughts about how we might slice things. I'll post some
initial thoughts on the jiras in this regard.

--
Jacques Nadeau
CTO and Co-Founder, Dremio

On Tue, Oct 27, 2015 at 5:39 PM, Julian Hyde <jh...@apache.org> wrote:

> Jacques, Can you please log the JIRA case you mentioned, and also attach
> any documentation (e.g. javadoc) you already have.
>
>

Re: [DISCUSS] Proposal to turn ValueVectors into separate reusable library & project

Posted by Julian Hyde <jh...@apache.org>.

Jacques, Can you please log the JIRA case you mentioned, and also attach any documentation (e.g. javadoc) you already have.

Re: [DISCUSS] Proposal to turn ValueVectors into separate reusable library & project

Posted by Julian Hyde <jh...@apache.org>.

+100

Thanks for spearheading this, Jacques.

They say memory is the new disk. So, it’s no longer sufficient to use the same on-disk data format if we want our tools to interoperate. The idea of engines interoperating by reading the same in-memory temporary tables, and passing data from one engine to another, is very exciting.

Also exciting is the idea that, by pooling our resources, we can spend less time maintaining all of this tricky code. :)

I know that the Hive and Storm teams have done a lot of work in this area already, and have their own technology, but I will encourage them to be part of this initiative.

Julian


> On Oct 26, 2015, at 3:35 PM, Ted Dunning <te...@gmail.com> wrote:
> 
> This sounds like a really good idea to me.
> 
> 
> 
> On Mon, Oct 26, 2015 at 2:50 PM, Julien Le Dem <ju...@dremio.com> wrote:
> 
>> +1, looking forward to vectorized Parquet Readers/Writers in Drill.
>> Making VV a standalone standard sounds great to me.
>> 
>> On Mon, Oct 26, 2015 at 2:46 PM, Parth Chandra <pa...@apache.org> wrote:
>> 
>>> +1. Agree with Hanifi that we probably should have done this sooner :).
>>> Jason and I faced this need when trying to get a stand alone vectorized
>>> parquet reader out of the Drill code last year.
>>> 
>>> 
>>> 
>>> On Mon, Oct 26, 2015 at 2:37 PM, Hanifi Gunes <hg...@maprtech.com>
>> wrote:
>>> 
>>>> I was hoping to see this discussion happening sooner :) VVs has helped
>>>> Drill representing and moving data around so flexibly that it would not
>>> be
>>>> hard to prove its usefulness to the community as a standalone library.
>> I
>>> am
>>>> in support of this proposal.
>>>> 
>>>> 
>>>> -Hanifi
>>>> 
>>>> On Mon, Oct 26, 2015 at 2:19 PM, Jacques Nadeau <ja...@dremio.com>
>>>> wrote:
>>>> 
>>>>> Drillers,
>>>>> 
>>>>> 
>>>>> 
>>>>> A number of people have approached me recently about the possibility
>> of
>>>>> collaborating on a shared columnar in-memory representation of data.
>>> This
>>>>> shared representation of data could be operated on efficiently with
>>>> modern
>>>>> cpus as well as shared efficiently via shared memory, IPC and RPC.
>> This
>>>>> would allow multiple applications to work together at high speed.
>>>> Examples
>>>>> include moving back and forth between a library.
>>>>> 
>>>>> 
>>>>> 
>>>>> As I was discussing these ideas with people working on projects
>>> including
>>>>> Calcite, Ibis, Kudu, Storm, Herron, Parquet and products from
>> companies
>>>>> like MapR and Trifacta, it became clear that much of what the Drill
>>>>> community has already constructed is very relevant to the goals of a
>>> new
>>>>> broader interchange and execution format. (In fact, Ted and I
>> actually
>>>>> informally discussed extracting this functionality as a library more
>>> than
>>>>> two years ago.)
>>>>> 
>>>>> 
>>>>> 
>>>>> A standard will emerge around this need and it is in the best
>> interest
>>> of
>>>>> the Drill community and the broader ecosystem if Drill’s ValueVectors
>>>>> concepts and code form the basis of a new
>>> library/collaboration/project.
>>>>> This means better interoperability, shared responsibility around
>>>>> maintenance and development and the avoidance of further division of
>>> the
>>>>> ecosystem.
>>>>> 
>>>>> 
>>>>> 
>>>>> A little background for some: Drill is the first project to create a
>>>>> powerful language agnostic in-memory representation of complex
>> columnar
>>>>> data. We've learned a lot over the last three years about how to
>>>> interface
>>>>> with these structures, manage memory associated with them, adjust
>> their
>>>>> sizes, expose them in builder patterns, etc. That work is useful for
>> a
>>>>> number of systems and it would be great if we could share the
>> learning.
>>>> By
>>>>> creating a new, well documented and collaborative library, people
>> could
>>>>> leverage this functionality in wider range of applications and
>> systems.
>>>>> 
>>>>> 
>>>>> 
>>>>> I’ve seen the great success that libraries like Parquet and Calcite
>>> have
>>>>> been able to achieve due to their focus on APIs, extensibility and
>>>>> reusability and I think we could do the same with the Drill
>> ValueVector
>>>>> codebase. The fact that this would allow higher speed interchange
>> among
>>>>> many other systems and becoming the standard for in-memory columnar
>>>>> exchange (as opposed to having to adopt an external standard) makes
>>> this
>>>> a
>>>>> great opportunity to both benefit the Drill community and give back
>> to
>>>> the
>>>>> broader Apache community.
>>>>> 
>>>>> 
>>>>> 
>>>>> As such, I’d like to open a discussion about taking this path. I
>> think
>>>>> there would be various avenues of how to do this but my initial
>>> proposal
>>>>> would be to propose this as a new project that goes straight to a
>>>>> provisional TLP. We then would work to clean up layer
>> responsibilities
>>>> and
>>>>> extract pieces of the code into this new project where we collaborate
>>>> with
>>>>> a wider group on a broader implementation (and more formal
>>>> specification).
>>>>> 
>>>>> 
>>>>> Given the conversations I have had and the excitement and need for
>>> this,
>>>> I
>>>>> think we should do this. If the community is supportive, we could
>>>> probably
>>>>> see some really cool integrations around things like high-speed
>> Python
>>>>> machine learning inside Drill operators before the end of the year.
>>>>> 
>>>>> 
>>>>> 
>>>>> I’ll open a new JIRA and attach it here where we can start a POC &
>>>>> discussion of how we could extract this code.
>>>>> 
>>>>> 
>>>>> Looking forward to feedback!
>>>>> 
>>>>> 
>>>>> Jacques
>>>>> 
>>>>> 
>>>>> --
>>>>> Jacques Nadeau
>>>>> CTO and Co-Founder, Dremio
>>>>> 
>>>> 
>>> 
>> 
>> 
>> 
>> --
>> Julien
>>

Re: [DISCUSS] Proposal to turn ValueVectors into separate reusable library & project

Posted by Ted Dunning <te...@gmail.com>.

This sounds like a really good idea to me.



On Mon, Oct 26, 2015 at 2:50 PM, Julien Le Dem <ju...@dremio.com> wrote:

> +1, looking forward to vectorized Parquet Readers/Writers in Drill.
> Making VV a standalone standard sounds great to me.
>
> On Mon, Oct 26, 2015 at 2:46 PM, Parth Chandra <pa...@apache.org> wrote:
>
> > +1. Agree with Hanifi that we probably should have done this sooner :).
> > Jason and I faced this need when trying to get a stand alone vectorized
> > parquet reader out of the Drill code last year.
> >
> >
> >
> > On Mon, Oct 26, 2015 at 2:37 PM, Hanifi Gunes <hg...@maprtech.com>
> wrote:
> >
> > > I was hoping to see this discussion happening sooner :) VVs has helped
> > > Drill representing and moving data around so flexibly that it would not
> > be
> > > hard to prove its usefulness to the community as a standalone library.
> I
> > am
> > > in support of this proposal.
> > >
> > >
> > > -Hanifi
> > >
> > > On Mon, Oct 26, 2015 at 2:19 PM, Jacques Nadeau <ja...@dremio.com>
> > > wrote:
> > >
> > > > Drillers,
> > > >
> > > >
> > > >
> > > > A number of people have approached me recently about the possibility
> of
> > > > collaborating on a shared columnar in-memory representation of data.
> > This
> > > > shared representation of data could be operated on efficiently with
> > > modern
> > > > cpus as well as shared efficiently via shared memory, IPC and RPC.
> This
> > > > would allow multiple applications to work together at high speed.
> > > Examples
> > > > include moving back and forth between a library.
> > > >
> > > >
> > > >
> > > > As I was discussing these ideas with people working on projects
> > including
> > > > Calcite, Ibis, Kudu, Storm, Herron, Parquet and products from
> companies
> > > > like MapR and Trifacta, it became clear that much of what the Drill
> > > > community has already constructed is very relevant to the goals of a
> > new
> > > > broader interchange and execution format. (In fact, Ted and I
> actually
> > > > informally discussed extracting this functionality as a library more
> > than
> > > > two years ago.)
> > > >
> > > >
> > > >
> > > > A standard will emerge around this need and it is in the best
> interest
> > of
> > > > the Drill community and the broader ecosystem if Drill’s ValueVectors
> > > > concepts and code form the basis of a new
> > library/collaboration/project.
> > > > This means better interoperability, shared responsibility around
> > > > maintenance and development and the avoidance of further division of
> > the
> > > > ecosystem.
> > > >
> > > >
> > > >
> > > > A little background for some: Drill is the first project to create a
> > > > powerful language agnostic in-memory representation of complex
> columnar
> > > > data. We've learned a lot over the last three years about how to
> > > interface
> > > > with these structures, manage memory associated with them, adjust
> their
> > > > sizes, expose them in builder patterns, etc. That work is useful for
> a
> > > > number of systems and it would be great if we could share the
> learning.
> > > By
> > > > creating a new, well documented and collaborative library, people
> could
> > > > leverage this functionality in wider range of applications and
> systems.
> > > >
> > > >
> > > >
> > > > I’ve seen the great success that libraries like Parquet and Calcite
> > have
> > > > been able to achieve due to their focus on APIs, extensibility and
> > > > reusability and I think we could do the same with the Drill
> ValueVector
> > > > codebase. The fact that this would allow higher speed interchange
> among
> > > > many other systems and becoming the standard for in-memory columnar
> > > > exchange (as opposed to having to adopt an external standard) makes
> > this
> > > a
> > > > great opportunity to both benefit the Drill community and give back
> to
> > > the
> > > > broader Apache community.
> > > >
> > > >
> > > >
> > > > As such, I’d like to open a discussion about taking this path. I
> think
> > > > there would be various avenues of how to do this but my initial
> > proposal
> > > > would be to propose this as a new project that goes straight to a
> > > > provisional TLP. We then would work to clean up layer
> responsibilities
> > > and
> > > > extract pieces of the code into this new project where we collaborate
> > > with
> > > > a wider group on a broader implementation (and more formal
> > > specification).
> > > >
> > > >
> > > > Given the conversations I have had and the excitement and need for
> > this,
> > > I
> > > > think we should do this. If the community is supportive, we could
> > > probably
> > > > see some really cool integrations around things like high-speed
> Python
> > > > machine learning inside Drill operators before the end of the year.
> > > >
> > > >
> > > >
> > > > I’ll open a new JIRA and attach it here where we can start a POC &
> > > > discussion of how we could extract this code.
> > > >
> > > >
> > > > Looking forward to feedback!
> > > >
> > > >
> > > > Jacques
> > > >
> > > >
> > > > --
> > > > Jacques Nadeau
> > > > CTO and Co-Founder, Dremio
> > > >
> > >
> >
>
>
>
> --
> Julien
>

Re: [DISCUSS] Proposal to turn ValueVectors into separate reusable library & project

Posted by Julien Le Dem <ju...@dremio.com>.

+1, looking forward to vectorized Parquet Readers/Writers in Drill.
Making VV a standalone standard sounds great to me.

On Mon, Oct 26, 2015 at 2:46 PM, Parth Chandra <pa...@apache.org> wrote:

> +1. Agree with Hanifi that we probably should have done this sooner :).
> Jason and I faced this need when trying to get a stand alone vectorized
> parquet reader out of the Drill code last year.
>
>
>
> On Mon, Oct 26, 2015 at 2:37 PM, Hanifi Gunes <hg...@maprtech.com> wrote:
>
> > I was hoping to see this discussion happening sooner :) VVs has helped
> > Drill representing and moving data around so flexibly that it would not
> be
> > hard to prove its usefulness to the community as a standalone library. I
> am
> > in support of this proposal.
> >
> >
> > -Hanifi
> >
> > On Mon, Oct 26, 2015 at 2:19 PM, Jacques Nadeau <ja...@dremio.com>
> > wrote:
> >
> > > Drillers,
> > >
> > >
> > >
> > > A number of people have approached me recently about the possibility of
> > > collaborating on a shared columnar in-memory representation of data.
> This
> > > shared representation of data could be operated on efficiently with
> > modern
> > > cpus as well as shared efficiently via shared memory, IPC and RPC. This
> > > would allow multiple applications to work together at high speed.
> > Examples
> > > include moving back and forth between a library.
> > >
> > >
> > >
> > > As I was discussing these ideas with people working on projects
> including
> > > Calcite, Ibis, Kudu, Storm, Herron, Parquet and products from companies
> > > like MapR and Trifacta, it became clear that much of what the Drill
> > > community has already constructed is very relevant to the goals of a
> new
> > > broader interchange and execution format. (In fact, Ted and I actually
> > > informally discussed extracting this functionality as a library more
> than
> > > two years ago.)
> > >
> > >
> > >
> > > A standard will emerge around this need and it is in the best interest
> of
> > > the Drill community and the broader ecosystem if Drill’s ValueVectors
> > > concepts and code form the basis of a new
> library/collaboration/project.
> > > This means better interoperability, shared responsibility around
> > > maintenance and development and the avoidance of further division of
> the
> > > ecosystem.
> > >
> > >
> > >
> > > A little background for some: Drill is the first project to create a
> > > powerful language agnostic in-memory representation of complex columnar
> > > data. We've learned a lot over the last three years about how to
> > interface
> > > with these structures, manage memory associated with them, adjust their
> > > sizes, expose them in builder patterns, etc. That work is useful for a
> > > number of systems and it would be great if we could share the learning.
> > By
> > > creating a new, well documented and collaborative library, people could
> > > leverage this functionality in wider range of applications and systems.
> > >
> > >
> > >
> > > I’ve seen the great success that libraries like Parquet and Calcite
> have
> > > been able to achieve due to their focus on APIs, extensibility and
> > > reusability and I think we could do the same with the Drill ValueVector
> > > codebase. The fact that this would allow higher speed interchange among
> > > many other systems and becoming the standard for in-memory columnar
> > > exchange (as opposed to having to adopt an external standard) makes
> this
> > a
> > > great opportunity to both benefit the Drill community and give back to
> > the
> > > broader Apache community.
> > >
> > >
> > >
> > > As such, I’d like to open a discussion about taking this path. I think
> > > there would be various avenues of how to do this but my initial
> proposal
> > > would be to propose this as a new project that goes straight to a
> > > provisional TLP. We then would work to clean up layer responsibilities
> > and
> > > extract pieces of the code into this new project where we collaborate
> > with
> > > a wider group on a broader implementation (and more formal
> > specification).
> > >
> > >
> > > Given the conversations I have had and the excitement and need for
> this,
> > I
> > > think we should do this. If the community is supportive, we could
> > probably
> > > see some really cool integrations around things like high-speed Python
> > > machine learning inside Drill operators before the end of the year.
> > >
> > >
> > >
> > > I’ll open a new JIRA and attach it here where we can start a POC &
> > > discussion of how we could extract this code.
> > >
> > >
> > > Looking forward to feedback!
> > >
> > >
> > > Jacques
> > >
> > >
> > > --
> > > Jacques Nadeau
> > > CTO and Co-Founder, Dremio
> > >
> >
>



-- 
Julien

Re: [DISCUSS] Proposal to turn ValueVectors into separate reusable library & project

Posted by Parth Chandra <pa...@apache.org>.

+1. Agree with Hanifi that we probably should have done this sooner :).
Jason and I faced this need when trying to get a stand alone vectorized
parquet reader out of the Drill code last year.



On Mon, Oct 26, 2015 at 2:37 PM, Hanifi Gunes <hg...@maprtech.com> wrote:

> I was hoping to see this discussion happening sooner :) VVs has helped
> Drill representing and moving data around so flexibly that it would not be
> hard to prove its usefulness to the community as a standalone library. I am
> in support of this proposal.
>
>
> -Hanifi
>
> On Mon, Oct 26, 2015 at 2:19 PM, Jacques Nadeau <ja...@dremio.com>
> wrote:
>
> > Drillers,
> >
> >
> >
> > A number of people have approached me recently about the possibility of
> > collaborating on a shared columnar in-memory representation of data. This
> > shared representation of data could be operated on efficiently with
> modern
> > cpus as well as shared efficiently via shared memory, IPC and RPC. This
> > would allow multiple applications to work together at high speed.
> Examples
> > include moving back and forth between a library.
> >
> >
> >
> > As I was discussing these ideas with people working on projects including
> > Calcite, Ibis, Kudu, Storm, Herron, Parquet and products from companies
> > like MapR and Trifacta, it became clear that much of what the Drill
> > community has already constructed is very relevant to the goals of a new
> > broader interchange and execution format. (In fact, Ted and I actually
> > informally discussed extracting this functionality as a library more than
> > two years ago.)
> >
> >
> >
> > A standard will emerge around this need and it is in the best interest of
> > the Drill community and the broader ecosystem if Drill’s ValueVectors
> > concepts and code form the basis of a new library/collaboration/project.
> > This means better interoperability, shared responsibility around
> > maintenance and development and the avoidance of further division of the
> > ecosystem.
> >
> >
> >
> > A little background for some: Drill is the first project to create a
> > powerful language agnostic in-memory representation of complex columnar
> > data. We've learned a lot over the last three years about how to
> interface
> > with these structures, manage memory associated with them, adjust their
> > sizes, expose them in builder patterns, etc. That work is useful for a
> > number of systems and it would be great if we could share the learning.
> By
> > creating a new, well documented and collaborative library, people could
> > leverage this functionality in wider range of applications and systems.
> >
> >
> >
> > I’ve seen the great success that libraries like Parquet and Calcite have
> > been able to achieve due to their focus on APIs, extensibility and
> > reusability and I think we could do the same with the Drill ValueVector
> > codebase. The fact that this would allow higher speed interchange among
> > many other systems and becoming the standard for in-memory columnar
> > exchange (as opposed to having to adopt an external standard) makes this
> a
> > great opportunity to both benefit the Drill community and give back to
> the
> > broader Apache community.
> >
> >
> >
> > As such, I’d like to open a discussion about taking this path. I think
> > there would be various avenues of how to do this but my initial proposal
> > would be to propose this as a new project that goes straight to a
> > provisional TLP. We then would work to clean up layer responsibilities
> and
> > extract pieces of the code into this new project where we collaborate
> with
> > a wider group on a broader implementation (and more formal
> specification).
> >
> >
> > Given the conversations I have had and the excitement and need for this,
> I
> > think we should do this. If the community is supportive, we could
> probably
> > see some really cool integrations around things like high-speed Python
> > machine learning inside Drill operators before the end of the year.
> >
> >
> >
> > I’ll open a new JIRA and attach it here where we can start a POC &
> > discussion of how we could extract this code.
> >
> >
> > Looking forward to feedback!
> >
> >
> > Jacques
> >
> >
> > --
> > Jacques Nadeau
> > CTO and Co-Founder, Dremio
> >
>

Re: [DISCUSS] Proposal to turn ValueVectors into separate reusable library & project

Posted by Hanifi Gunes <hg...@maprtech.com>.

I was hoping to see this discussion happening sooner :) VVs has helped
Drill representing and moving data around so flexibly that it would not be
hard to prove its usefulness to the community as a standalone library. I am
in support of this proposal.


-Hanifi

On Mon, Oct 26, 2015 at 2:19 PM, Jacques Nadeau <ja...@dremio.com> wrote:

> Drillers,
>
>
>
> A number of people have approached me recently about the possibility of
> collaborating on a shared columnar in-memory representation of data. This
> shared representation of data could be operated on efficiently with modern
> cpus as well as shared efficiently via shared memory, IPC and RPC. This
> would allow multiple applications to work together at high speed. Examples
> include moving back and forth between a library.
>
>
>
> As I was discussing these ideas with people working on projects including
> Calcite, Ibis, Kudu, Storm, Herron, Parquet and products from companies
> like MapR and Trifacta, it became clear that much of what the Drill
> community has already constructed is very relevant to the goals of a new
> broader interchange and execution format. (In fact, Ted and I actually
> informally discussed extracting this functionality as a library more than
> two years ago.)
>
>
>
> A standard will emerge around this need and it is in the best interest of
> the Drill community and the broader ecosystem if Drill’s ValueVectors
> concepts and code form the basis of a new library/collaboration/project.
> This means better interoperability, shared responsibility around
> maintenance and development and the avoidance of further division of the
> ecosystem.
>
>
>
> A little background for some: Drill is the first project to create a
> powerful language agnostic in-memory representation of complex columnar
> data. We've learned a lot over the last three years about how to interface
> with these structures, manage memory associated with them, adjust their
> sizes, expose them in builder patterns, etc. That work is useful for a
> number of systems and it would be great if we could share the learning. By
> creating a new, well documented and collaborative library, people could
> leverage this functionality in wider range of applications and systems.
>
>
>
> I’ve seen the great success that libraries like Parquet and Calcite have
> been able to achieve due to their focus on APIs, extensibility and
> reusability and I think we could do the same with the Drill ValueVector
> codebase. The fact that this would allow higher speed interchange among
> many other systems and becoming the standard for in-memory columnar
> exchange (as opposed to having to adopt an external standard) makes this a
> great opportunity to both benefit the Drill community and give back to the
> broader Apache community.
>
>
>
> As such, I’d like to open a discussion about taking this path. I think
> there would be various avenues of how to do this but my initial proposal
> would be to propose this as a new project that goes straight to a
> provisional TLP. We then would work to clean up layer responsibilities and
> extract pieces of the code into this new project where we collaborate with
> a wider group on a broader implementation (and more formal specification).
>
>
> Given the conversations I have had and the excitement and need for this, I
> think we should do this. If the community is supportive, we could probably
> see some really cool integrations around things like high-speed Python
> machine learning inside Drill operators before the end of the year.
>
>
>
> I’ll open a new JIRA and attach it here where we can start a POC &
> discussion of how we could extract this code.
>
>
> Looking forward to feedback!
>
>
> Jacques
>
>
> --
> Jacques Nadeau
> CTO and Co-Founder, Dremio
>