You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@arrow.apache.org by Wes McKinney <we...@gmail.com> on 2018/06/25 20:45:19 UTC

Arrow and oamap [was: Re: Gandiva Initiative]

hi Martin,

These projects are very different. Many analytic databases feature
code generation (recently a lot of these use LLVM -- see Hyper, Apache
Impala, and others) on the hot paths for function evaluation (e.g. for
evaluating the expressions in the SELECT part or the WHERE part) --
the reason people are excited about Gandiva is that it makes this type
of functionality available as a library running atop an open standard
memory format (Arrow columnar), so can be used in any programming
language assuming suitable bindings can be developed. This is very
much in line with our vision for creating a "deconstructed database"
(see a talk that Julien gave on this topic:
https://www.slideshare.net/julienledem/from-flat-files-to-deconstructed-database)

I have not looked a great deal at oamap, but it does not use the Arrow
columnar format AFAIK. It is written in Python and presumes some other
technologies in use (like the ROOT format).

So to summarize:

Gandiva
* Compiles analytical expressions to execute against Arrow columnar format
* Is written in C++ and can be embedded in other systems (Dremio is
using it from Java)

oamap
* Does not use the Arrow columnar format
* Presumes other technologies in use (ROOT)
* Is written in Python, and would be challenging to use an embedded
system component

I'm certain these projects can learn from each other -- I have spoken
with Jim (one of the developers of oamap) in the past, so welcome
further discussion here on the mailing list.

Thanks,
Wes

On Mon, Jun 25, 2018 at 1:27 PM, Martin Durant
<ma...@utoronto.ca> wrote:
> I am a little surprised by the very positive reception to Gandiva (which doubtless is very useful - I know very little about it) versus when I brought up the prospect of using oamap ( https://github.com/diana-hep/oamap ) on this mailing list.
>
> oamap uses numba to compile *python* functions at run-time and can walk complex nested schema down to leaf nodes in native python syntax (for-loops and attribute/item lookup) but at full machine speed, and without materialising any objects along the way. It was written for the ROOT format, but has implementations for simple types in parquet and arrow, which each do the nested lists and dict things similarly but differently.
>
> Would someone care to explain the silence over oamap?
>
>> On 25 Jun 2018, at 02:06, Praveen Kumar <pr...@dremio.com> wrote:
>>
>> Hi Everyone,
>>
>> I am Praveen, another engineer working on Gandiva. The interest and speed of engagement around this is great !!Excited to engage with you folks on this.
>>
>> Thx.
>>
>> On 2018/06/22 18:09:42, Julian Hyde <j....@apache.org> wrote:
>>> This is exciting. We have wanted to build an Arrow adapter in Calcite for some time and have a prototype (see https://issues.apache.org/jira/browse/CALCITE-2173 <https://issues.apache.org/jira/browse/CALCITE-2173>) but I hope that we can use Gandiva. I know that Gandiva has Java bindings, but will these allow queries to be compiled and executed from a pure Java process?>
>>>
>>> Can you describe Gandiva’s governance model? Without an open governance model, companies that compete with Dremio may be wary about contributing.>
>>>
>>> Can you compare and contrast your approach to Hyper[1]? Hyper is also concerned with efficient use to the bus, and also uses LLVM, but it has a different memory format and places much emphasis on lock-free data structures.>
>>>
>>> I just attended SIGMOD and there were interesting industry papers from MemSQL[2][3] and Oracle RAPID[4]. I was impressed with some of the tricks MemSQL uses to achieve SIMD parallelism on queries such as “select k4, sum(x) from t group by k4” (where k4 has 4 values).>
>>>
>>> I missed part of the RAPID talk, but I got the impression that they are using disk-based algorithms (e.g. hybrid hash join) to handle data spread between fast and slow memory.>
>>>
>>> MemSQL uses TPC-H query 1 as a motivating benchmark and I think this would be good target for Gandiva also. It is a table scan with a range filter (returning 98% of rows), a low-cardinality aggregate (grouping by two fields with 3 values each), and several aggregate functions, the arguments of which contain common sub-expressions.>
>>>
>>>  SELECT>
>>>    l_returnflag,>
>>>    l_linestatus,>
>>>    sum(l_quantity),>
>>>    sum(l_extendedprice),>
>>>    sum(l_extendedprice * (1 - l_discount)),>
>>>    sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)),>
>>>    avg(l_quantity),>
>>>    avg(l_extendedprice),>
>>>    avg(l_discount),>
>>>    count(*)>
>>>  FROM lineitem>
>>>  WHERE l_shipdate <= date '1998-12-01' - interval '90’ day>
>>>  GROUP BY>
>>>    l_returnflag,>
>>>    l_linestatus>
>>>  ORDER BY>
>>>    l_returnflag,>
>>>    l_linestatus;>
>>>
>>> Julian>
>>>
>>> [1] http://www.vldb.org/pvldb/vol4/p539-neumann.pdf <http://www.vldb.org/pvldb/vol4/p539-neumann.pdf>>
>>>
>>> [2] http://blog.memsql.com/how-careful-engineering-lead-to-processing-over-a-trillion-rows-per-second/ <http://blog.memsql.com/how-careful-engineering-lead-to-processing-over-a-trillion-rows-per-second/>>
>>>
>>> [3] https://dl.acm.org/citation.cfm?id=3183713.3190658 <https://dl.acm.org/citation.cfm?id=3183713.3190658>>
>>>
>>> [4] https://dl.acm.org/citation.cfm?id=3183713.3190655 <https://dl.acm.org/citation.cfm?id=3183713.3190655>>
>>>
>>>> On Jun 22, 2018, at 7:22 AM, ravindrap@gmail.com wrote:>
>>>>>
>>>> Hi everyone,>
>>>>>
>>>> I'm Ravindra and I'm a developer on the Gandiva project. I do believe that the combination of arrow and llvm for efficient expression evaluation is powerful, and has a broad range of use-cases. We've just started and hope to finesse and add a lot of functionality over the next few months.>
>>>>>
>>>> Welcome your feedback and participation in gandiva !!>
>>>>>
>>>> thanks & regards,>
>>>> ravindra.>
>>>>>
>>>> On 2018/06/21 19:15:20, Jacques Nadeau <ja...@apache.org> wrote: >
>>>>> Hey Guys,>
>>>>>>
>>>>> Dremio just open sourced a new framework for processing data in Arrow data>
>>>>> structures [1], built on top of the Apache Arrow C++ APIs and leveraging>
>>>>> LLVM (Apache licensed). It also includes Java APIs that leverage the Apache>
>>>>> Arrow Java libraries. I expect the developers who have been working on this>
>>>>> will introduce themselves soon. To read more about it, take a look at our>
>>>>> Ravindra's blog post (he's the lead developer driving this work): [2].>
>>>>> Hopefully people will find this interesting/useful.>
>>>>>>
>>>>> Let us know what you all think!>
>>>>>>
>>>>> thanks,>
>>>>> Jacques>
>>>>>>
>>>>>>
>>>>> [1] https://github.com/dremio/gandiva>
>>>>> [2] https://www.dremio.com/announcing-gandiva-initiative-for-apache-arrow/>
>>>>>>
>>>
>
> —
> Martin Durant
> martin.durant@utoronto.ca
>
>
>

Re: Arrow and oamap [was: Re: Gandiva Initiative]

Posted by Wes McKinney <we...@gmail.com>.

Thanks for pointing this out, I was not aware of it.

I looked through the mailing list logs -- you brought up oamap on a
thread that was about something else and so a lot of people probably
did not see it. It would be great to hear directly from the developers
of oamap here to see if there are collaboration opportunities and how
we can help each other.

Thanks
Wes

On Mon, Jun 25, 2018 at 5:00 PM, Martin Durant
<ma...@utoronto.ca> wrote:
> Let me just quickly correct a couple of point, for clarity.
> The following module
> https://github.com/diana-hep/oamap/blob/master/oamap/backend/arrow.py
> is a proof-of-concept of oamap running directly on arrow memory - this is the original reason I raised the topic here.
>
> There are also POCs showing operation on numpy records and parquet files. oamap was written with ROOT in mind,
> but is not necessarily tied to it; it just so happens that ROOT data tends to have just the kind of deeply nested structures
> that tend to perform terribly in a pandas apply situation or other python object-based processing. oamap depends on
> numba/llvm-lite
>
> So indeed maybe this all belongs as a conversation in pyarrow rather than arrow, but insomuch as it enables - or may in
> the future enable - machine-speed computation on in-memory nested arrow data, I think oamap should be on
> everyone’s radar as a interesting and useful project.
>
>> On 25 Jun 2018, at 16:45, Wes McKinney <we...@gmail.com> wrote:
>>
>> hi Martin,
>>
>> These projects are very different. Many analytic databases feature
>> code generation (recently a lot of these use LLVM -- see Hyper, Apache
>> Impala, and others) on the hot paths for function evaluation (e.g. for
>> evaluating the expressions in the SELECT part or the WHERE part) --
>> the reason people are excited about Gandiva is that it makes this type
>> of functionality available as a library running atop an open standard
>> memory format (Arrow columnar), so can be used in any programming
>> language assuming suitable bindings can be developed. This is very
>> much in line with our vision for creating a "deconstructed database"
>> (see a talk that Julien gave on this topic:
>> https://www.slideshare.net/julienledem/from-flat-files-to-deconstructed-database)
>>
>> I have not looked a great deal at oamap, but it does not use the Arrow
>> columnar format AFAIK. It is written in Python and presumes some other
>> technologies in use (like the ROOT format).
>>
>> So to summarize:
>>
>> Gandiva
>> * Compiles analytical expressions to execute against Arrow columnar format
>> * Is written in C++ and can be embedded in other systems (Dremio is
>> using it from Java)
>>
>> oamap
>> * Does not use the Arrow columnar format
>> * Presumes other technologies in use (ROOT)
>> * Is written in Python, and would be challenging to use an embedded
>> system component
>>
>> I'm certain these projects can learn from each other -- I have spoken
>> with Jim (one of the developers of oamap) in the past, so welcome
>> further discussion here on the mailing list.
>>
>> Thanks,
>> Wes
>>
>> On Mon, Jun 25, 2018 at 1:27 PM, Martin Durant
>> <ma...@utoronto.ca> wrote:
>>> I am a little surprised by the very positive reception to Gandiva (which doubtless is very useful - I know very little about it) versus when I brought up the prospect of using oamap ( https://github.com/diana-hep/oamap ) on this mailing list.
>>>
>>> oamap uses numba to compile *python* functions at run-time and can walk complex nested schema down to leaf nodes in native python syntax (for-loops and attribute/item lookup) but at full machine speed, and without materialising any objects along the way. It was written for the ROOT format, but has implementations for simple types in parquet and arrow, which each do the nested lists and dict things similarly but differently.
>>>
>>> Would someone care to explain the silence over oamap?
>>>
>>>> On 25 Jun 2018, at 02:06, Praveen Kumar <pr...@dremio.com> wrote:
>>>>
>>>> Hi Everyone,
>>>>
>>>> I am Praveen, another engineer working on Gandiva. The interest and speed of engagement around this is great !!Excited to engage with you folks on this.
>>>>
>>>> Thx.
>>>>
>>>> On 2018/06/22 18:09:42, Julian Hyde <j....@apache.org> wrote:
>>>>> This is exciting. We have wanted to build an Arrow adapter in Calcite for some time and have a prototype (see https://issues.apache.org/jira/browse/CALCITE-2173 <https://issues.apache.org/jira/browse/CALCITE-2173>) but I hope that we can use Gandiva. I know that Gandiva has Java bindings, but will these allow queries to be compiled and executed from a pure Java process?>
>>>>>
>>>>> Can you describe Gandiva’s governance model? Without an open governance model, companies that compete with Dremio may be wary about contributing.>
>>>>>
>>>>> Can you compare and contrast your approach to Hyper[1]? Hyper is also concerned with efficient use to the bus, and also uses LLVM, but it has a different memory format and places much emphasis on lock-free data structures.>
>>>>>
>>>>> I just attended SIGMOD and there were interesting industry papers from MemSQL[2][3] and Oracle RAPID[4]. I was impressed with some of the tricks MemSQL uses to achieve SIMD parallelism on queries such as “select k4, sum(x) from t group by k4” (where k4 has 4 values).>
>>>>>
>>>>> I missed part of the RAPID talk, but I got the impression that they are using disk-based algorithms (e.g. hybrid hash join) to handle data spread between fast and slow memory.>
>>>>>
>>>>> MemSQL uses TPC-H query 1 as a motivating benchmark and I think this would be good target for Gandiva also. It is a table scan with a range filter (returning 98% of rows), a low-cardinality aggregate (grouping by two fields with 3 values each), and several aggregate functions, the arguments of which contain common sub-expressions.>
>>>>>
>>>>> SELECT>
>>>>>   l_returnflag,>
>>>>>   l_linestatus,>
>>>>>   sum(l_quantity),>
>>>>>   sum(l_extendedprice),>
>>>>>   sum(l_extendedprice * (1 - l_discount)),>
>>>>>   sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)),>
>>>>>   avg(l_quantity),>
>>>>>   avg(l_extendedprice),>
>>>>>   avg(l_discount),>
>>>>>   count(*)>
>>>>> FROM lineitem>
>>>>> WHERE l_shipdate <= date '1998-12-01' - interval '90’ day>
>>>>> GROUP BY>
>>>>>   l_returnflag,>
>>>>>   l_linestatus>
>>>>> ORDER BY>
>>>>>   l_returnflag,>
>>>>>   l_linestatus;>
>>>>>
>>>>> Julian>
>>>>>
>>>>> [1] http://www.vldb.org/pvldb/vol4/p539-neumann.pdf <http://www.vldb.org/pvldb/vol4/p539-neumann.pdf>>
>>>>>
>>>>> [2] http://blog.memsql.com/how-careful-engineering-lead-to-processing-over-a-trillion-rows-per-second/ <http://blog.memsql.com/how-careful-engineering-lead-to-processing-over-a-trillion-rows-per-second/>>
>>>>>
>>>>> [3] https://dl.acm.org/citation.cfm?id=3183713.3190658 <https://dl.acm.org/citation.cfm?id=3183713.3190658>>
>>>>>
>>>>> [4] https://dl.acm.org/citation.cfm?id=3183713.3190655 <https://dl.acm.org/citation.cfm?id=3183713.3190655>>
>>>>>
>>>>>> On Jun 22, 2018, at 7:22 AM, ravindrap@gmail.com wrote:>
>>>>>>>
>>>>>> Hi everyone,>
>>>>>>>
>>>>>> I'm Ravindra and I'm a developer on the Gandiva project. I do believe that the combination of arrow and llvm for efficient expression evaluation is powerful, and has a broad range of use-cases. We've just started and hope to finesse and add a lot of functionality over the next few months.>
>>>>>>>
>>>>>> Welcome your feedback and participation in gandiva !!>
>>>>>>>
>>>>>> thanks & regards,>
>>>>>> ravindra.>
>>>>>>>
>>>>>> On 2018/06/21 19:15:20, Jacques Nadeau <ja...@apache.org> wrote: >
>>>>>>> Hey Guys,>
>>>>>>>>
>>>>>>> Dremio just open sourced a new framework for processing data in Arrow data>
>>>>>>> structures [1], built on top of the Apache Arrow C++ APIs and leveraging>
>>>>>>> LLVM (Apache licensed). It also includes Java APIs that leverage the Apache>
>>>>>>> Arrow Java libraries. I expect the developers who have been working on this>
>>>>>>> will introduce themselves soon. To read more about it, take a look at our>
>>>>>>> Ravindra's blog post (he's the lead developer driving this work): [2].>
>>>>>>> Hopefully people will find this interesting/useful.>
>>>>>>>>
>>>>>>> Let us know what you all think!>
>>>>>>>>
>>>>>>> thanks,>
>>>>>>> Jacques>
>>>>>>>>
>>>>>>>>
>>>>>>> [1] https://github.com/dremio/gandiva>
>>>>>>> [2] https://www.dremio.com/announcing-gandiva-initiative-for-apache-arrow/>
>>>>>>>>
>>>>>
>>>
>>> —
>>> Martin Durant
>>> martin.durant@utoronto.ca
>>>
>>>
>>>
>
> —
> Martin Durant
> martin.durant@utoronto.ca
>
>
>

Re: Arrow and oamap [was: Re: Gandiva Initiative]

Posted by Jim Pivarski <jp...@gmail.com>.

Thanks, I'll let you know if I run into issues with Gandiva, and I'll pass
that on to my contact with ALICE.

I haven't caught up to the earlier part of this conversation yet, but the
main thing that was lacking for my use case was handling arbitrary data
structures and arbitrary code structures. I'll look into it, but I don't
know yet if Gandiva will satisfy this need (earlier discussion referred to
SQL SELECT and WHERE, but we need to go beyond that). Don't worry about
duplication of effort— I'm developing only those pieces that I haven't been
able to find a ready-made solution for, and I'd be happy to use one if
it'll work for us.

Cheers,
Jim



On Mon, Jun 25, 2018, 10:03 PM Wes McKinney <we...@gmail.com> wrote:

> Thanks Jim, that's helpful. In the long run, we'd like to make sure
> that the Arrow libraries can serve as a robust dependency for
> computational systems like OAMap. It doesn't strike me that OAMap is a
> library that could be used within the Arrow project, though there are
> some things which could be implemented as part of a function kernel
> library or function code-generator. The streaming data / IPC machinery
> we have developed could be useful for working with large on-disk
> datasets or efficient data movement between nodes in a cluster.
>
> Let us know as you run into issues or missing features so we can
> incorporate into our development roadmap.
>
> best
> Wes
>
> On Mon, Jun 25, 2018 at 5:25 PM, Jim Pivarski <jp...@gmail.com> wrote:
> > What Martin said about OAMap and ROOT is true: there's no dependence,
> ROOT
> > and Arrow are both backends.
> >
> > What Wes said about embeddability is also right: OAMap is pure
> Python+Numpy
> > and would be hard to use within a C++ framework (or Java). This has
> already
> > been an issue with a potential user in the ALICE experiment. They're
> > considering Arrow in a C++ framework.
> >
> > I've been quiet for a few months because I've been trying to work OAMap
> > into Dask and having trouble with the fact that OAMap data are "loose—"
> the
> > schema is a separate thing from the data and different columns of the
> data
> > can be in different files, in different file format, on different
> machines.
> > I've found another reformulation of the problem in a bottom-up way, in
> > which the data are in an extended set of array types— like Numpy arrays,
> > but they can be chunked, jagged, AOS-SOA split, generated on demand, etc.
> > The schema is implicit in the nesting of these objects, rather than being
> > explicit and separate from the objects, and they reproduce all of the
> > desirable effects of OAMap.
> >
> > So, my project being a research project, I'm following this promising
> line
> > of attack. But it probably cuts this discussion short if it was about
> Gandiva
> > vs OAMap as duplicating effort. Martin, I was planning on sending you a
> > note about this when I had a working example, particularly if I could
> > convince Dask that these arrays can be Dask lazy arrays. (I started
> working
> > on this alternate approach exactly two weeks ago— it's very recent.)
> >
> > Meanwhile, I'm also going to look into Gandiva and possibly recommend it
> to
> > the ALICE experiment.
> >
> > Cheers,
> > Jim
> >
> >
> >
> >
> > On Mon, Jun 25, 2018, 4:00 PM Martin Durant <ma...@utoronto.ca>
> > wrote:
> >
> >> Let me just quickly correct a couple of point, for clarity.
> >> The following module
> >> https://github.com/diana-hep/oamap/blob/master/oamap/backend/arrow.py
> >> is a proof-of-concept of oamap running directly on arrow memory - this
> is
> >> the original reason I raised the topic here.
> >>
> >> There are also POCs showing operation on numpy records and parquet
> files.
> >> oamap was written with ROOT in mind,
> >> but is not necessarily tied to it; it just so happens that ROOT data
> tends
> >> to have just the kind of deeply nested structures
> >> that tend to perform terribly in a pandas apply situation or other
> python
> >> object-based processing. oamap depends on
> >> numba/llvm-lite
> >>
> >> So indeed maybe this all belongs as a conversation in pyarrow rather
> than
> >> arrow, but insomuch as it enables - or may in
> >> the future enable - machine-speed computation on in-memory nested arrow
> >> data, I think oamap should be on
> >> everyone’s radar as a interesting and useful project.
> >>
> >> > On 25 Jun 2018, at 16:45, Wes McKinney <we...@gmail.com> wrote:
> >> >
> >> > hi Martin,
> >> >
> >> > These projects are very different. Many analytic databases feature
> >> > code generation (recently a lot of these use LLVM -- see Hyper, Apache
> >> > Impala, and others) on the hot paths for function evaluation (e.g. for
> >> > evaluating the expressions in the SELECT part or the WHERE part) --
> >> > the reason people are excited about Gandiva is that it makes this type
> >> > of functionality available as a library running atop an open standard
> >> > memory format (Arrow columnar), so can be used in any programming
> >> > language assuming suitable bindings can be developed. This is very
> >> > much in line with our vision for creating a "deconstructed database"
> >> > (see a talk that Julien gave on this topic:
> >> >
> >>
> https://www.slideshare.net/julienledem/from-flat-files-to-deconstructed-database
> >> )
> >> >
> >> > I have not looked a great deal at oamap, but it does not use the Arrow
> >> > columnar format AFAIK. It is written in Python and presumes some other
> >> > technologies in use (like the ROOT format).
> >> >
> >> > So to summarize:
> >> >
> >> > Gandiva
> >> > * Compiles analytical expressions to execute against Arrow columnar
> >> format
> >> > * Is written in C++ and can be embedded in other systems (Dremio is
> >> > using it from Java)
> >> >
> >> > oamap
> >> > * Does not use the Arrow columnar format
> >> > * Presumes other technologies in use (ROOT)
> >> > * Is written in Python, and would be challenging to use an embedded
> >> > system component
> >> >
> >> > I'm certain these projects can learn from each other -- I have spoken
> >> > with Jim (one of the developers of oamap) in the past, so welcome
> >> > further discussion here on the mailing list.
> >> >
> >> > Thanks,
> >> > Wes
> >> >
> >> > On Mon, Jun 25, 2018 at 1:27 PM, Martin Durant
> >> > <ma...@utoronto.ca> wrote:
> >> >> I am a little surprised by the very positive reception to Gandiva
> >> (which doubtless is very useful - I know very little about it) versus
> when
> >> I brought up the prospect of using oamap (
> >> https://github.com/diana-hep/oamap ) on this mailing list.
> >> >>
> >> >> oamap uses numba to compile *python* functions at run-time and can
> walk
> >> complex nested schema down to leaf nodes in native python syntax
> (for-loops
> >> and attribute/item lookup) but at full machine speed, and without
> >> materialising any objects along the way. It was written for the ROOT
> >> format, but has implementations for simple types in parquet and arrow,
> >> which each do the nested lists and dict things similarly but
> differently.
> >> >>
> >> >> Would someone care to explain the silence over oamap?
> >> >>
> >> >>> On 25 Jun 2018, at 02:06, Praveen Kumar <pr...@dremio.com> wrote:
> >> >>>
> >> >>> Hi Everyone,
> >> >>>
> >> >>> I am Praveen, another engineer working on Gandiva. The interest and
> >> speed of engagement around this is great !!Excited to engage with you
> folks
> >> on this.
> >> >>>
> >> >>> Thx.
> >> >>>
> >> >>> On 2018/06/22 18:09:42, Julian Hyde <j....@apache.org> wrote:
> >> >>>> This is exciting. We have wanted to build an Arrow adapter in
> Calcite
> >> for some time and have a prototype (see
> >> https://issues.apache.org/jira/browse/CALCITE-2173 <
> >> https://issues.apache.org/jira/browse/CALCITE-2173>) but I hope that we
> >> can use Gandiva. I know that Gandiva has Java bindings, but will these
> >> allow queries to be compiled and executed from a pure Java process?>
> >> >>>>
> >> >>>> Can you describe Gandiva’s governance model? Without an open
> >> governance model, companies that compete with Dremio may be wary about
> >> contributing.>
> >> >>>>
> >> >>>> Can you compare and contrast your approach to Hyper[1]? Hyper is
> also
> >> concerned with efficient use to the bus, and also uses LLVM, but it has
> a
> >> different memory format and places much emphasis on lock-free data
> >> structures.>
> >> >>>>
> >> >>>> I just attended SIGMOD and there were interesting industry papers
> >> from MemSQL[2][3] and Oracle RAPID[4]. I was impressed with some of the
> >> tricks MemSQL uses to achieve SIMD parallelism on queries such as
> “select
> >> k4, sum(x) from t group by k4” (where k4 has 4 values).>
> >> >>>>
> >> >>>> I missed part of the RAPID talk, but I got the impression that they
> >> are using disk-based algorithms (e.g. hybrid hash join) to handle data
> >> spread between fast and slow memory.>
> >> >>>>
> >> >>>> MemSQL uses TPC-H query 1 as a motivating benchmark and I think
> this
> >> would be good target for Gandiva also. It is a table scan with a range
> >> filter (returning 98% of rows), a low-cardinality aggregate (grouping by
> >> two fields with 3 values each), and several aggregate functions, the
> >> arguments of which contain common sub-expressions.>
> >> >>>>
> >> >>>> SELECT>
> >> >>>>   l_returnflag,>
> >> >>>>   l_linestatus,>
> >> >>>>   sum(l_quantity),>
> >> >>>>   sum(l_extendedprice),>
> >> >>>>   sum(l_extendedprice * (1 - l_discount)),>
> >> >>>>   sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)),>
> >> >>>>   avg(l_quantity),>
> >> >>>>   avg(l_extendedprice),>
> >> >>>>   avg(l_discount),>
> >> >>>>   count(*)>
> >> >>>> FROM lineitem>
> >> >>>> WHERE l_shipdate <= date '1998-12-01' - interval '90’ day>
> >> >>>> GROUP BY>
> >> >>>>   l_returnflag,>
> >> >>>>   l_linestatus>
> >> >>>> ORDER BY>
> >> >>>>   l_returnflag,>
> >> >>>>   l_linestatus;>
> >> >>>>
> >> >>>> Julian>
> >> >>>>
> >> >>>> [1] http://www.vldb.org/pvldb/vol4/p539-neumann.pdf <
> >> http://www.vldb.org/pvldb/vol4/p539-neumann.pdf>>
> >> >>>>
> >> >>>> [2]
> >>
> http://blog.memsql.com/how-careful-engineering-lead-to-processing-over-a-trillion-rows-per-second/
> >> <
> >>
> http://blog.memsql.com/how-careful-engineering-lead-to-processing-over-a-trillion-rows-per-second/
> >> >>
> >> >>>>
> >> >>>> [3] https://dl.acm.org/citation.cfm?id=3183713.3190658 <
> >> https://dl.acm.org/citation.cfm?id=3183713.3190658>>
> >> >>>>
> >> >>>> [4] https://dl.acm.org/citation.cfm?id=3183713.3190655 <
> >> https://dl.acm.org/citation.cfm?id=3183713.3190655>>
> >> >>>>
> >> >>>>> On Jun 22, 2018, at 7:22 AM, ravindrap@gmail.com wrote:>
> >> >>>>>>
> >> >>>>> Hi everyone,>
> >> >>>>>>
> >> >>>>> I'm Ravindra and I'm a developer on the Gandiva project. I do
> >> believe that the combination of arrow and llvm for efficient expression
> >> evaluation is powerful, and has a broad range of use-cases. We've just
> >> started and hope to finesse and add a lot of functionality over the next
> >> few months.>
> >> >>>>>>
> >> >>>>> Welcome your feedback and participation in gandiva !!>
> >> >>>>>>
> >> >>>>> thanks & regards,>
> >> >>>>> ravindra.>
> >> >>>>>>
> >> >>>>> On 2018/06/21 19:15:20, Jacques Nadeau <ja...@apache.org> wrote:
> >
> >> >>>>>> Hey Guys,>
> >> >>>>>>>
> >> >>>>>> Dremio just open sourced a new framework for processing data in
> >> Arrow data>
> >> >>>>>> structures [1], built on top of the Apache Arrow C++ APIs and
> >> leveraging>
> >> >>>>>> LLVM (Apache licensed). It also includes Java APIs that leverage
> >> the Apache>
> >> >>>>>> Arrow Java libraries. I expect the developers who have been
> working
> >> on this>
> >> >>>>>> will introduce themselves soon. To read more about it, take a
> look
> >> at our>
> >> >>>>>> Ravindra's blog post (he's the lead developer driving this work):
> >> [2].>
> >> >>>>>> Hopefully people will find this interesting/useful.>
> >> >>>>>>>
> >> >>>>>> Let us know what you all think!>
> >> >>>>>>>
> >> >>>>>> thanks,>
> >> >>>>>> Jacques>
> >> >>>>>>>
> >> >>>>>>>
> >> >>>>>> [1] https://github.com/dremio/gandiva>
> >> >>>>>> [2]
> >> https://www.dremio.com/announcing-gandiva-initiative-for-apache-arrow/>
> >> >>>>>>>
> >> >>>>
> >> >>
> >> >> —
> >> >> Martin Durant
> >> >> martin.durant@utoronto.ca
> >> >>
> >> >>
> >> >>
> >>
> >> —
> >> Martin Durant
> >> martin.durant@utoronto.ca
> >>
> >>
> >>
> >>
>

Re: Arrow and oamap [was: Re: Gandiva Initiative]

Posted by Wes McKinney <we...@gmail.com>.

Thanks Jim, that's helpful. In the long run, we'd like to make sure
that the Arrow libraries can serve as a robust dependency for
computational systems like OAMap. It doesn't strike me that OAMap is a
library that could be used within the Arrow project, though there are
some things which could be implemented as part of a function kernel
library or function code-generator. The streaming data / IPC machinery
we have developed could be useful for working with large on-disk
datasets or efficient data movement between nodes in a cluster.

Let us know as you run into issues or missing features so we can
incorporate into our development roadmap.

best
Wes

On Mon, Jun 25, 2018 at 5:25 PM, Jim Pivarski <jp...@gmail.com> wrote:
> What Martin said about OAMap and ROOT is true: there's no dependence, ROOT
> and Arrow are both backends.
>
> What Wes said about embeddability is also right: OAMap is pure Python+Numpy
> and would be hard to use within a C++ framework (or Java). This has already
> been an issue with a potential user in the ALICE experiment. They're
> considering Arrow in a C++ framework.
>
> I've been quiet for a few months because I've been trying to work OAMap
> into Dask and having trouble with the fact that OAMap data are "loose—" the
> schema is a separate thing from the data and different columns of the data
> can be in different files, in different file format, on different machines.
> I've found another reformulation of the problem in a bottom-up way, in
> which the data are in an extended set of array types— like Numpy arrays,
> but they can be chunked, jagged, AOS-SOA split, generated on demand, etc.
> The schema is implicit in the nesting of these objects, rather than being
> explicit and separate from the objects, and they reproduce all of the
> desirable effects of OAMap.
>
> So, my project being a research project, I'm following this promising line
> of attack. But it probably cuts this discussion short if it was about Gandiva
> vs OAMap as duplicating effort. Martin, I was planning on sending you a
> note about this when I had a working example, particularly if I could
> convince Dask that these arrays can be Dask lazy arrays. (I started working
> on this alternate approach exactly two weeks ago— it's very recent.)
>
> Meanwhile, I'm also going to look into Gandiva and possibly recommend it to
> the ALICE experiment.
>
> Cheers,
> Jim
>
>
>
>
> On Mon, Jun 25, 2018, 4:00 PM Martin Durant <ma...@utoronto.ca>
> wrote:
>
>> Let me just quickly correct a couple of point, for clarity.
>> The following module
>> https://github.com/diana-hep/oamap/blob/master/oamap/backend/arrow.py
>> is a proof-of-concept of oamap running directly on arrow memory - this is
>> the original reason I raised the topic here.
>>
>> There are also POCs showing operation on numpy records and parquet files.
>> oamap was written with ROOT in mind,
>> but is not necessarily tied to it; it just so happens that ROOT data tends
>> to have just the kind of deeply nested structures
>> that tend to perform terribly in a pandas apply situation or other python
>> object-based processing. oamap depends on
>> numba/llvm-lite
>>
>> So indeed maybe this all belongs as a conversation in pyarrow rather than
>> arrow, but insomuch as it enables - or may in
>> the future enable - machine-speed computation on in-memory nested arrow
>> data, I think oamap should be on
>> everyone’s radar as a interesting and useful project.
>>
>> > On 25 Jun 2018, at 16:45, Wes McKinney <we...@gmail.com> wrote:
>> >
>> > hi Martin,
>> >
>> > These projects are very different. Many analytic databases feature
>> > code generation (recently a lot of these use LLVM -- see Hyper, Apache
>> > Impala, and others) on the hot paths for function evaluation (e.g. for
>> > evaluating the expressions in the SELECT part or the WHERE part) --
>> > the reason people are excited about Gandiva is that it makes this type
>> > of functionality available as a library running atop an open standard
>> > memory format (Arrow columnar), so can be used in any programming
>> > language assuming suitable bindings can be developed. This is very
>> > much in line with our vision for creating a "deconstructed database"
>> > (see a talk that Julien gave on this topic:
>> >
>> https://www.slideshare.net/julienledem/from-flat-files-to-deconstructed-database
>> )
>> >
>> > I have not looked a great deal at oamap, but it does not use the Arrow
>> > columnar format AFAIK. It is written in Python and presumes some other
>> > technologies in use (like the ROOT format).
>> >
>> > So to summarize:
>> >
>> > Gandiva
>> > * Compiles analytical expressions to execute against Arrow columnar
>> format
>> > * Is written in C++ and can be embedded in other systems (Dremio is
>> > using it from Java)
>> >
>> > oamap
>> > * Does not use the Arrow columnar format
>> > * Presumes other technologies in use (ROOT)
>> > * Is written in Python, and would be challenging to use an embedded
>> > system component
>> >
>> > I'm certain these projects can learn from each other -- I have spoken
>> > with Jim (one of the developers of oamap) in the past, so welcome
>> > further discussion here on the mailing list.
>> >
>> > Thanks,
>> > Wes
>> >
>> > On Mon, Jun 25, 2018 at 1:27 PM, Martin Durant
>> > <ma...@utoronto.ca> wrote:
>> >> I am a little surprised by the very positive reception to Gandiva
>> (which doubtless is very useful - I know very little about it) versus when
>> I brought up the prospect of using oamap (
>> https://github.com/diana-hep/oamap ) on this mailing list.
>> >>
>> >> oamap uses numba to compile *python* functions at run-time and can walk
>> complex nested schema down to leaf nodes in native python syntax (for-loops
>> and attribute/item lookup) but at full machine speed, and without
>> materialising any objects along the way. It was written for the ROOT
>> format, but has implementations for simple types in parquet and arrow,
>> which each do the nested lists and dict things similarly but differently.
>> >>
>> >> Would someone care to explain the silence over oamap?
>> >>
>> >>> On 25 Jun 2018, at 02:06, Praveen Kumar <pr...@dremio.com> wrote:
>> >>>
>> >>> Hi Everyone,
>> >>>
>> >>> I am Praveen, another engineer working on Gandiva. The interest and
>> speed of engagement around this is great !!Excited to engage with you folks
>> on this.
>> >>>
>> >>> Thx.
>> >>>
>> >>> On 2018/06/22 18:09:42, Julian Hyde <j....@apache.org> wrote:
>> >>>> This is exciting. We have wanted to build an Arrow adapter in Calcite
>> for some time and have a prototype (see
>> https://issues.apache.org/jira/browse/CALCITE-2173 <
>> https://issues.apache.org/jira/browse/CALCITE-2173>) but I hope that we
>> can use Gandiva. I know that Gandiva has Java bindings, but will these
>> allow queries to be compiled and executed from a pure Java process?>
>> >>>>
>> >>>> Can you describe Gandiva’s governance model? Without an open
>> governance model, companies that compete with Dremio may be wary about
>> contributing.>
>> >>>>
>> >>>> Can you compare and contrast your approach to Hyper[1]? Hyper is also
>> concerned with efficient use to the bus, and also uses LLVM, but it has a
>> different memory format and places much emphasis on lock-free data
>> structures.>
>> >>>>
>> >>>> I just attended SIGMOD and there were interesting industry papers
>> from MemSQL[2][3] and Oracle RAPID[4]. I was impressed with some of the
>> tricks MemSQL uses to achieve SIMD parallelism on queries such as “select
>> k4, sum(x) from t group by k4” (where k4 has 4 values).>
>> >>>>
>> >>>> I missed part of the RAPID talk, but I got the impression that they
>> are using disk-based algorithms (e.g. hybrid hash join) to handle data
>> spread between fast and slow memory.>
>> >>>>
>> >>>> MemSQL uses TPC-H query 1 as a motivating benchmark and I think this
>> would be good target for Gandiva also. It is a table scan with a range
>> filter (returning 98% of rows), a low-cardinality aggregate (grouping by
>> two fields with 3 values each), and several aggregate functions, the
>> arguments of which contain common sub-expressions.>
>> >>>>
>> >>>> SELECT>
>> >>>>   l_returnflag,>
>> >>>>   l_linestatus,>
>> >>>>   sum(l_quantity),>
>> >>>>   sum(l_extendedprice),>
>> >>>>   sum(l_extendedprice * (1 - l_discount)),>
>> >>>>   sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)),>
>> >>>>   avg(l_quantity),>
>> >>>>   avg(l_extendedprice),>
>> >>>>   avg(l_discount),>
>> >>>>   count(*)>
>> >>>> FROM lineitem>
>> >>>> WHERE l_shipdate <= date '1998-12-01' - interval '90’ day>
>> >>>> GROUP BY>
>> >>>>   l_returnflag,>
>> >>>>   l_linestatus>
>> >>>> ORDER BY>
>> >>>>   l_returnflag,>
>> >>>>   l_linestatus;>
>> >>>>
>> >>>> Julian>
>> >>>>
>> >>>> [1] http://www.vldb.org/pvldb/vol4/p539-neumann.pdf <
>> http://www.vldb.org/pvldb/vol4/p539-neumann.pdf>>
>> >>>>
>> >>>> [2]
>> http://blog.memsql.com/how-careful-engineering-lead-to-processing-over-a-trillion-rows-per-second/
>> <
>> http://blog.memsql.com/how-careful-engineering-lead-to-processing-over-a-trillion-rows-per-second/
>> >>
>> >>>>
>> >>>> [3] https://dl.acm.org/citation.cfm?id=3183713.3190658 <
>> https://dl.acm.org/citation.cfm?id=3183713.3190658>>
>> >>>>
>> >>>> [4] https://dl.acm.org/citation.cfm?id=3183713.3190655 <
>> https://dl.acm.org/citation.cfm?id=3183713.3190655>>
>> >>>>
>> >>>>> On Jun 22, 2018, at 7:22 AM, ravindrap@gmail.com wrote:>
>> >>>>>>
>> >>>>> Hi everyone,>
>> >>>>>>
>> >>>>> I'm Ravindra and I'm a developer on the Gandiva project. I do
>> believe that the combination of arrow and llvm for efficient expression
>> evaluation is powerful, and has a broad range of use-cases. We've just
>> started and hope to finesse and add a lot of functionality over the next
>> few months.>
>> >>>>>>
>> >>>>> Welcome your feedback and participation in gandiva !!>
>> >>>>>>
>> >>>>> thanks & regards,>
>> >>>>> ravindra.>
>> >>>>>>
>> >>>>> On 2018/06/21 19:15:20, Jacques Nadeau <ja...@apache.org> wrote: >
>> >>>>>> Hey Guys,>
>> >>>>>>>
>> >>>>>> Dremio just open sourced a new framework for processing data in
>> Arrow data>
>> >>>>>> structures [1], built on top of the Apache Arrow C++ APIs and
>> leveraging>
>> >>>>>> LLVM (Apache licensed). It also includes Java APIs that leverage
>> the Apache>
>> >>>>>> Arrow Java libraries. I expect the developers who have been working
>> on this>
>> >>>>>> will introduce themselves soon. To read more about it, take a look
>> at our>
>> >>>>>> Ravindra's blog post (he's the lead developer driving this work):
>> [2].>
>> >>>>>> Hopefully people will find this interesting/useful.>
>> >>>>>>>
>> >>>>>> Let us know what you all think!>
>> >>>>>>>
>> >>>>>> thanks,>
>> >>>>>> Jacques>
>> >>>>>>>
>> >>>>>>>
>> >>>>>> [1] https://github.com/dremio/gandiva>
>> >>>>>> [2]
>> https://www.dremio.com/announcing-gandiva-initiative-for-apache-arrow/>
>> >>>>>>>
>> >>>>
>> >>
>> >> —
>> >> Martin Durant
>> >> martin.durant@utoronto.ca
>> >>
>> >>
>> >>
>>
>> —
>> Martin Durant
>> martin.durant@utoronto.ca
>>
>>
>>
>>

Re: Arrow and oamap [was: Re: Gandiva Initiative]

Posted by Jim Pivarski <jp...@gmail.com>.

What Martin said about OAMap and ROOT is true: there's no dependence, ROOT
and Arrow are both backends.

What Wes said about embeddability is also right: OAMap is pure Python+Numpy
and would be hard to use within a C++ framework (or Java). This has already
been an issue with a potential user in the ALICE experiment. They're
considering Arrow in a C++ framework.

I've been quiet for a few months because I've been trying to work OAMap
into Dask and having trouble with the fact that OAMap data are "loose—" the
schema is a separate thing from the data and different columns of the data
can be in different files, in different file format, on different machines.
I've found another reformulation of the problem in a bottom-up way, in
which the data are in an extended set of array types— like Numpy arrays,
but they can be chunked, jagged, AOS-SOA split, generated on demand, etc.
The schema is implicit in the nesting of these objects, rather than being
explicit and separate from the objects, and they reproduce all of the
desirable effects of OAMap.

So, my project being a research project, I'm following this promising line
of attack. But it probably cuts this discussion short if it was about Gandiva
vs OAMap as duplicating effort. Martin, I was planning on sending you a
note about this when I had a working example, particularly if I could
convince Dask that these arrays can be Dask lazy arrays. (I started working
on this alternate approach exactly two weeks ago— it's very recent.)

Meanwhile, I'm also going to look into Gandiva and possibly recommend it to
the ALICE experiment.

Cheers,
Jim




On Mon, Jun 25, 2018, 4:00 PM Martin Durant <ma...@utoronto.ca>
wrote:

> Let me just quickly correct a couple of point, for clarity.
> The following module
> https://github.com/diana-hep/oamap/blob/master/oamap/backend/arrow.py
> is a proof-of-concept of oamap running directly on arrow memory - this is
> the original reason I raised the topic here.
>
> There are also POCs showing operation on numpy records and parquet files.
> oamap was written with ROOT in mind,
> but is not necessarily tied to it; it just so happens that ROOT data tends
> to have just the kind of deeply nested structures
> that tend to perform terribly in a pandas apply situation or other python
> object-based processing. oamap depends on
> numba/llvm-lite
>
> So indeed maybe this all belongs as a conversation in pyarrow rather than
> arrow, but insomuch as it enables - or may in
> the future enable - machine-speed computation on in-memory nested arrow
> data, I think oamap should be on
> everyone’s radar as a interesting and useful project.
>
> > On 25 Jun 2018, at 16:45, Wes McKinney <we...@gmail.com> wrote:
> >
> > hi Martin,
> >
> > These projects are very different. Many analytic databases feature
> > code generation (recently a lot of these use LLVM -- see Hyper, Apache
> > Impala, and others) on the hot paths for function evaluation (e.g. for
> > evaluating the expressions in the SELECT part or the WHERE part) --
> > the reason people are excited about Gandiva is that it makes this type
> > of functionality available as a library running atop an open standard
> > memory format (Arrow columnar), so can be used in any programming
> > language assuming suitable bindings can be developed. This is very
> > much in line with our vision for creating a "deconstructed database"
> > (see a talk that Julien gave on this topic:
> >
> https://www.slideshare.net/julienledem/from-flat-files-to-deconstructed-database
> )
> >
> > I have not looked a great deal at oamap, but it does not use the Arrow
> > columnar format AFAIK. It is written in Python and presumes some other
> > technologies in use (like the ROOT format).
> >
> > So to summarize:
> >
> > Gandiva
> > * Compiles analytical expressions to execute against Arrow columnar
> format
> > * Is written in C++ and can be embedded in other systems (Dremio is
> > using it from Java)
> >
> > oamap
> > * Does not use the Arrow columnar format
> > * Presumes other technologies in use (ROOT)
> > * Is written in Python, and would be challenging to use an embedded
> > system component
> >
> > I'm certain these projects can learn from each other -- I have spoken
> > with Jim (one of the developers of oamap) in the past, so welcome
> > further discussion here on the mailing list.
> >
> > Thanks,
> > Wes
> >
> > On Mon, Jun 25, 2018 at 1:27 PM, Martin Durant
> > <ma...@utoronto.ca> wrote:
> >> I am a little surprised by the very positive reception to Gandiva
> (which doubtless is very useful - I know very little about it) versus when
> I brought up the prospect of using oamap (
> https://github.com/diana-hep/oamap ) on this mailing list.
> >>
> >> oamap uses numba to compile *python* functions at run-time and can walk
> complex nested schema down to leaf nodes in native python syntax (for-loops
> and attribute/item lookup) but at full machine speed, and without
> materialising any objects along the way. It was written for the ROOT
> format, but has implementations for simple types in parquet and arrow,
> which each do the nested lists and dict things similarly but differently.
> >>
> >> Would someone care to explain the silence over oamap?
> >>
> >>> On 25 Jun 2018, at 02:06, Praveen Kumar <pr...@dremio.com> wrote:
> >>>
> >>> Hi Everyone,
> >>>
> >>> I am Praveen, another engineer working on Gandiva. The interest and
> speed of engagement around this is great !!Excited to engage with you folks
> on this.
> >>>
> >>> Thx.
> >>>
> >>> On 2018/06/22 18:09:42, Julian Hyde <j....@apache.org> wrote:
> >>>> This is exciting. We have wanted to build an Arrow adapter in Calcite
> for some time and have a prototype (see
> https://issues.apache.org/jira/browse/CALCITE-2173 <
> https://issues.apache.org/jira/browse/CALCITE-2173>) but I hope that we
> can use Gandiva. I know that Gandiva has Java bindings, but will these
> allow queries to be compiled and executed from a pure Java process?>
> >>>>
> >>>> Can you describe Gandiva’s governance model? Without an open
> governance model, companies that compete with Dremio may be wary about
> contributing.>
> >>>>
> >>>> Can you compare and contrast your approach to Hyper[1]? Hyper is also
> concerned with efficient use to the bus, and also uses LLVM, but it has a
> different memory format and places much emphasis on lock-free data
> structures.>
> >>>>
> >>>> I just attended SIGMOD and there were interesting industry papers
> from MemSQL[2][3] and Oracle RAPID[4]. I was impressed with some of the
> tricks MemSQL uses to achieve SIMD parallelism on queries such as “select
> k4, sum(x) from t group by k4” (where k4 has 4 values).>
> >>>>
> >>>> I missed part of the RAPID talk, but I got the impression that they
> are using disk-based algorithms (e.g. hybrid hash join) to handle data
> spread between fast and slow memory.>
> >>>>
> >>>> MemSQL uses TPC-H query 1 as a motivating benchmark and I think this
> would be good target for Gandiva also. It is a table scan with a range
> filter (returning 98% of rows), a low-cardinality aggregate (grouping by
> two fields with 3 values each), and several aggregate functions, the
> arguments of which contain common sub-expressions.>
> >>>>
> >>>> SELECT>
> >>>>   l_returnflag,>
> >>>>   l_linestatus,>
> >>>>   sum(l_quantity),>
> >>>>   sum(l_extendedprice),>
> >>>>   sum(l_extendedprice * (1 - l_discount)),>
> >>>>   sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)),>
> >>>>   avg(l_quantity),>
> >>>>   avg(l_extendedprice),>
> >>>>   avg(l_discount),>
> >>>>   count(*)>
> >>>> FROM lineitem>
> >>>> WHERE l_shipdate <= date '1998-12-01' - interval '90’ day>
> >>>> GROUP BY>
> >>>>   l_returnflag,>
> >>>>   l_linestatus>
> >>>> ORDER BY>
> >>>>   l_returnflag,>
> >>>>   l_linestatus;>
> >>>>
> >>>> Julian>
> >>>>
> >>>> [1] http://www.vldb.org/pvldb/vol4/p539-neumann.pdf <
> http://www.vldb.org/pvldb/vol4/p539-neumann.pdf>>
> >>>>
> >>>> [2]
> http://blog.memsql.com/how-careful-engineering-lead-to-processing-over-a-trillion-rows-per-second/
> <
> http://blog.memsql.com/how-careful-engineering-lead-to-processing-over-a-trillion-rows-per-second/
> >>
> >>>>
> >>>> [3] https://dl.acm.org/citation.cfm?id=3183713.3190658 <
> https://dl.acm.org/citation.cfm?id=3183713.3190658>>
> >>>>
> >>>> [4] https://dl.acm.org/citation.cfm?id=3183713.3190655 <
> https://dl.acm.org/citation.cfm?id=3183713.3190655>>
> >>>>
> >>>>> On Jun 22, 2018, at 7:22 AM, ravindrap@gmail.com wrote:>
> >>>>>>
> >>>>> Hi everyone,>
> >>>>>>
> >>>>> I'm Ravindra and I'm a developer on the Gandiva project. I do
> believe that the combination of arrow and llvm for efficient expression
> evaluation is powerful, and has a broad range of use-cases. We've just
> started and hope to finesse and add a lot of functionality over the next
> few months.>
> >>>>>>
> >>>>> Welcome your feedback and participation in gandiva !!>
> >>>>>>
> >>>>> thanks & regards,>
> >>>>> ravindra.>
> >>>>>>
> >>>>> On 2018/06/21 19:15:20, Jacques Nadeau <ja...@apache.org> wrote: >
> >>>>>> Hey Guys,>
> >>>>>>>
> >>>>>> Dremio just open sourced a new framework for processing data in
> Arrow data>
> >>>>>> structures [1], built on top of the Apache Arrow C++ APIs and
> leveraging>
> >>>>>> LLVM (Apache licensed). It also includes Java APIs that leverage
> the Apache>
> >>>>>> Arrow Java libraries. I expect the developers who have been working
> on this>
> >>>>>> will introduce themselves soon. To read more about it, take a look
> at our>
> >>>>>> Ravindra's blog post (he's the lead developer driving this work):
> [2].>
> >>>>>> Hopefully people will find this interesting/useful.>
> >>>>>>>
> >>>>>> Let us know what you all think!>
> >>>>>>>
> >>>>>> thanks,>
> >>>>>> Jacques>
> >>>>>>>
> >>>>>>>
> >>>>>> [1] https://github.com/dremio/gandiva>
> >>>>>> [2]
> https://www.dremio.com/announcing-gandiva-initiative-for-apache-arrow/>
> >>>>>>>
> >>>>
> >>
> >> —
> >> Martin Durant
> >> martin.durant@utoronto.ca
> >>
> >>
> >>
>
> —
> Martin Durant
> martin.durant@utoronto.ca
>
>
>
>

Re: Arrow and oamap [was: Re: Gandiva Initiative]

Posted by Martin Durant <ma...@utoronto.ca>.

Let me just quickly correct a couple of point, for clarity.
The following module
https://github.com/diana-hep/oamap/blob/master/oamap/backend/arrow.py
is a proof-of-concept of oamap running directly on arrow memory - this is the original reason I raised the topic here. 

There are also POCs showing operation on numpy records and parquet files. oamap was written with ROOT in mind, 
but is not necessarily tied to it; it just so happens that ROOT data tends to have just the kind of deeply nested structures
that tend to perform terribly in a pandas apply situation or other python object-based processing. oamap depends on 
numba/llvm-lite

So indeed maybe this all belongs as a conversation in pyarrow rather than arrow, but insomuch as it enables - or may in
the future enable - machine-speed computation on in-memory nested arrow data, I think oamap should be on
everyone’s radar as a interesting and useful project.

> On 25 Jun 2018, at 16:45, Wes McKinney <we...@gmail.com> wrote:
> 
> hi Martin,
> 
> These projects are very different. Many analytic databases feature
> code generation (recently a lot of these use LLVM -- see Hyper, Apache
> Impala, and others) on the hot paths for function evaluation (e.g. for
> evaluating the expressions in the SELECT part or the WHERE part) --
> the reason people are excited about Gandiva is that it makes this type
> of functionality available as a library running atop an open standard
> memory format (Arrow columnar), so can be used in any programming
> language assuming suitable bindings can be developed. This is very
> much in line with our vision for creating a "deconstructed database"
> (see a talk that Julien gave on this topic:
> https://www.slideshare.net/julienledem/from-flat-files-to-deconstructed-database)
> 
> I have not looked a great deal at oamap, but it does not use the Arrow
> columnar format AFAIK. It is written in Python and presumes some other
> technologies in use (like the ROOT format).
> 
> So to summarize:
> 
> Gandiva
> * Compiles analytical expressions to execute against Arrow columnar format
> * Is written in C++ and can be embedded in other systems (Dremio is
> using it from Java)
> 
> oamap
> * Does not use the Arrow columnar format
> * Presumes other technologies in use (ROOT)
> * Is written in Python, and would be challenging to use an embedded
> system component
> 
> I'm certain these projects can learn from each other -- I have spoken
> with Jim (one of the developers of oamap) in the past, so welcome
> further discussion here on the mailing list.
> 
> Thanks,
> Wes
> 
> On Mon, Jun 25, 2018 at 1:27 PM, Martin Durant
> <ma...@utoronto.ca> wrote:
>> I am a little surprised by the very positive reception to Gandiva (which doubtless is very useful - I know very little about it) versus when I brought up the prospect of using oamap ( https://github.com/diana-hep/oamap ) on this mailing list.
>> 
>> oamap uses numba to compile *python* functions at run-time and can walk complex nested schema down to leaf nodes in native python syntax (for-loops and attribute/item lookup) but at full machine speed, and without materialising any objects along the way. It was written for the ROOT format, but has implementations for simple types in parquet and arrow, which each do the nested lists and dict things similarly but differently.
>> 
>> Would someone care to explain the silence over oamap?
>> 
>>> On 25 Jun 2018, at 02:06, Praveen Kumar <pr...@dremio.com> wrote:
>>> 
>>> Hi Everyone,
>>> 
>>> I am Praveen, another engineer working on Gandiva. The interest and speed of engagement around this is great !!Excited to engage with you folks on this.
>>> 
>>> Thx.
>>> 
>>> On 2018/06/22 18:09:42, Julian Hyde <j....@apache.org> wrote:
>>>> This is exciting. We have wanted to build an Arrow adapter in Calcite for some time and have a prototype (see https://issues.apache.org/jira/browse/CALCITE-2173 <https://issues.apache.org/jira/browse/CALCITE-2173>) but I hope that we can use Gandiva. I know that Gandiva has Java bindings, but will these allow queries to be compiled and executed from a pure Java process?>
>>>> 
>>>> Can you describe Gandiva’s governance model? Without an open governance model, companies that compete with Dremio may be wary about contributing.>
>>>> 
>>>> Can you compare and contrast your approach to Hyper[1]? Hyper is also concerned with efficient use to the bus, and also uses LLVM, but it has a different memory format and places much emphasis on lock-free data structures.>
>>>> 
>>>> I just attended SIGMOD and there were interesting industry papers from MemSQL[2][3] and Oracle RAPID[4]. I was impressed with some of the tricks MemSQL uses to achieve SIMD parallelism on queries such as “select k4, sum(x) from t group by k4” (where k4 has 4 values).>
>>>> 
>>>> I missed part of the RAPID talk, but I got the impression that they are using disk-based algorithms (e.g. hybrid hash join) to handle data spread between fast and slow memory.>
>>>> 
>>>> MemSQL uses TPC-H query 1 as a motivating benchmark and I think this would be good target for Gandiva also. It is a table scan with a range filter (returning 98% of rows), a low-cardinality aggregate (grouping by two fields with 3 values each), and several aggregate functions, the arguments of which contain common sub-expressions.>
>>>> 
>>>> SELECT>
>>>>   l_returnflag,>
>>>>   l_linestatus,>
>>>>   sum(l_quantity),>
>>>>   sum(l_extendedprice),>
>>>>   sum(l_extendedprice * (1 - l_discount)),>
>>>>   sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)),>
>>>>   avg(l_quantity),>
>>>>   avg(l_extendedprice),>
>>>>   avg(l_discount),>
>>>>   count(*)>
>>>> FROM lineitem>
>>>> WHERE l_shipdate <= date '1998-12-01' - interval '90’ day>
>>>> GROUP BY>
>>>>   l_returnflag,>
>>>>   l_linestatus>
>>>> ORDER BY>
>>>>   l_returnflag,>
>>>>   l_linestatus;>
>>>> 
>>>> Julian>
>>>> 
>>>> [1] http://www.vldb.org/pvldb/vol4/p539-neumann.pdf <http://www.vldb.org/pvldb/vol4/p539-neumann.pdf>>
>>>> 
>>>> [2] http://blog.memsql.com/how-careful-engineering-lead-to-processing-over-a-trillion-rows-per-second/ <http://blog.memsql.com/how-careful-engineering-lead-to-processing-over-a-trillion-rows-per-second/>>
>>>> 
>>>> [3] https://dl.acm.org/citation.cfm?id=3183713.3190658 <https://dl.acm.org/citation.cfm?id=3183713.3190658>>
>>>> 
>>>> [4] https://dl.acm.org/citation.cfm?id=3183713.3190655 <https://dl.acm.org/citation.cfm?id=3183713.3190655>>
>>>> 
>>>>> On Jun 22, 2018, at 7:22 AM, ravindrap@gmail.com wrote:>
>>>>>> 
>>>>> Hi everyone,>
>>>>>> 
>>>>> I'm Ravindra and I'm a developer on the Gandiva project. I do believe that the combination of arrow and llvm for efficient expression evaluation is powerful, and has a broad range of use-cases. We've just started and hope to finesse and add a lot of functionality over the next few months.>
>>>>>> 
>>>>> Welcome your feedback and participation in gandiva !!>
>>>>>> 
>>>>> thanks & regards,>
>>>>> ravindra.>
>>>>>> 
>>>>> On 2018/06/21 19:15:20, Jacques Nadeau <ja...@apache.org> wrote: >
>>>>>> Hey Guys,>
>>>>>>> 
>>>>>> Dremio just open sourced a new framework for processing data in Arrow data>
>>>>>> structures [1], built on top of the Apache Arrow C++ APIs and leveraging>
>>>>>> LLVM (Apache licensed). It also includes Java APIs that leverage the Apache>
>>>>>> Arrow Java libraries. I expect the developers who have been working on this>
>>>>>> will introduce themselves soon. To read more about it, take a look at our>
>>>>>> Ravindra's blog post (he's the lead developer driving this work): [2].>
>>>>>> Hopefully people will find this interesting/useful.>
>>>>>>> 
>>>>>> Let us know what you all think!>
>>>>>>> 
>>>>>> thanks,>
>>>>>> Jacques>
>>>>>>> 
>>>>>>> 
>>>>>> [1] https://github.com/dremio/gandiva>
>>>>>> [2] https://www.dremio.com/announcing-gandiva-initiative-for-apache-arrow/>
>>>>>>> 
>>>> 
>> 
>> —
>> Martin Durant
>> martin.durant@utoronto.ca
>> 
>> 
>> 

—
Martin Durant
martin.durant@utoronto.ca