You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@arrow.apache.org by Micah Kornfield <em...@gmail.com> on 2019/10/03 04:56:40 UTC

Re: [DISCUSS][Java] Design of the algorithm module

Hi Liya Fan,
Thanks again for writing this up.  I think it provides a road-map for
intended features.  I commented on the document but I wanted to raise a few
high-level concerns here as well to get more feedback from the community.

1.  It isn't clear to me who the users will of this will be.  My perception
is that in the Java ecosystem there aren't use-cases for the algorithms
outside of specific compute engines.  I'm not super involved in open-source
Java these days so I would love to hear others opinions. For instance, I'm
not sure if Dremio would switch to using these algorithms instead of the
ones they've already open-sourced  [1] and Apache Spark I believe is only
using Arrow for interfacing with Python (they similarly have there own
compute pipeline).  I think you mentioned in the past that these are being
used internally on an engine that your company is working on, but if that
is the only consumer it makes me wonder if the algorithm development might
be better served as part of that engine.

2.  If we do move forward with this, we also need a plan for how to
optimize the algorithms to avoid virtual calls.  There are two high-level
approaches template-based and (byte)code generation based.  Both aren't
applicable in all situations but it would be good to come consensus on when
(and when not to) use each.

Thanks,
Micah

[1]
https://github.com/dremio/dremio-oss/tree/master/sabot/kernel/src/main/java/com/dremio/sabot/op/sort/external

On Tue, Sep 24, 2019 at 6:48 AM Fan Liya <li...@gmail.com> wrote:

> Hi Micah,
>
> Thanks for your effort and precious time.
> Looking forward to receiving more valuable feedback from you.
>
> Best,
> Liya Fan
>
> On Tue, Sep 24, 2019 at 2:12 PM Micah Kornfield <em...@gmail.com>
> wrote:
>
>> Hi Liya Fan,
>> I started reviewing but haven't gotten all the way through it. I will try
>> to leave more comments over the next few days.
>>
>> Thanks again for the write-up I think it will help frame a productive
>> conversation.
>>
>> -Micah
>>
>> On Tue, Sep 17, 2019 at 1:47 AM Fan Liya <li...@gmail.com> wrote:
>>
>>> Hi Micah,
>>>
>>> Thanks for your kind reminder. Comments are enabled now.
>>>
>>> Best,
>>> Liya Fan
>>>
>>> On Tue, Sep 17, 2019 at 12:45 PM Micah Kornfield <em...@gmail.com>
>>> wrote:
>>>
>>>> Hi Liya Fan,
>>>> Thank you for this writeup, it doesn't look like comments are enabled on
>>>> the document.  Could you allow for them?
>>>>
>>>> Thanks,
>>>> Micah
>>>>
>>>> On Sat, Sep 14, 2019 at 6:57 AM Fan Liya <li...@gmail.com> wrote:
>>>>
>>>> > Dear all,
>>>> >
>>>> > We have prepared a document for discussing the requirements, design
>>>> and
>>>> > implementation issues for the algorithm module of Java:
>>>> >
>>>> >
>>>> >
>>>> https://docs.google.com/document/d/17nqHWS7gs0vARfeDAcUEbhKMOYHnCtA46TOY_Nls69s/edit?usp=sharing
>>>> >
>>>> > So far, we have finished the initial draft for sort, search and
>>>> dictionary
>>>> > encoding algorithms. Discussions for more algorithms may be added in
>>>> the
>>>> > future. This document will keep evolving to reflect the latest
>>>> discussion
>>>> > results in the community and the latest code changes.
>>>> >
>>>> > Please give your valuable feedback.
>>>> >
>>>> > Best,
>>>> > Liya Fan
>>>> >
>>>>
>>>

Re: [DISCUSS][Java] Design of the algorithm module

Posted by Fan Liya <li...@gmail.com>.

Dear all,

I have added the draft for the fourth part of the document.
This part contains discussion of more algorithms, some of which are already
in progress.

Please pay special attention to Section 4.2.1, as it contains a
general discussion about the representation of integer vectors.
Please take a look, and give your valuable feedback:

https://docs.google.com/document/d/17nqHWS7gs0vARfeDAcUEbhKMOYHnCtA46TOY_Nls69s/edit?usp=sharing

Thanks a lot for your attention.

Best,
Liya Fan


On Sat, Oct 5, 2019 at 2:59 PM fan_li_ya <fa...@aliyun.com> wrote:

> Hi Micah and Praveen,
>
> Thanks a lot for your valuable feedback.
>
> My thoughts on the problems:
>
> 1. About audiance of the algorithms:
>
> I think the algorithms should be better termed "micro-algorithms". They
> are termed "micro" in the sense that they do not directly compose a query
> engine, because they only provide primitive functionalities (e.g. vector
> sort).
> Instead, they can be used as building blocks for query engines.  The major
> benefit of the micro-algorithms is their generality: they can be used in
> wide ranges of common scenarios. They can be used in more than one query
> engine. In addition, there are other common scenarios, like vector data
> compression/decompression (e.g. dictionary encoding and RLE encoding, as we
> have already supported/discussed), IPC communication, data analysis, data
> mining, etc.
>
> 2. About performance improvments:
>
> Code generation and template types are powerful tools. In addition, JIT is
> also a powerful tool, as it can inline megamorphic virtual functions for
> many scenarios, if the algorithm is implemented appropriately.
> IMO, code generation is applicable to almost all scenarios to achieve good
> performance, if we are willing to pay the price of code readability.
> I will try to detail the principles for choosing these tools for
> performance improvements later.
>
> Best,
> Liya Fan
>
> ------------------------------------------------------------------
> 发件人：Praveen Kumar <pr...@dremio.com>
> 发送时间：2019年10月4日(星期五) 19:20
> 收件人：Micah Kornfield <em...@gmail.com>
> 抄 送：Fan Liya <li...@gmail.com>; dev <de...@arrow.apache.org>
> 主 题：Re: [DISCUSS][Java] Design of the algorithm module
>
> Hi Micah,
>
> I agree with 1., i think as an end user, what they would really want is a
> query/data processing engine. I am not sure how easy/relevant the
> algorithms will be in the absence of the engine. For e.g. most of these
> operators would need to pipelined, handle memory, distribution etc. So
> bundling this along with engine makes a lot more sense, the interfaces
> required might be a bit different too for that.
>
> Thx.
>
>
>
> On Thu, Oct 3, 2019 at 10:27 AM Micah Kornfield <em...@gmail.com>
> wrote:
>
> > Hi Liya Fan,
> > Thanks again for writing this up.  I think it provides a road-map for
>
> > intended features.  I commented on the document but I wanted to raise a few
> > high-level concerns here as well to get more feedback from the community.
> >
>
> > 1.  It isn't clear to me who the users will of this will be.  My perception
> > is that in the Java ecosystem there aren't use-cases for the algorithms
>
> > outside of specific compute engines.  I'm not super involved in open-source
>
> > Java these days so I would love to hear others opinions. For instance, I'm
> > not sure if Dremio would switch to using these algorithms instead of the
> > ones they've already open-sourced  [1] and Apache Spark I believe is only
> > using Arrow for interfacing with Python (they similarly have there own
>
> > compute pipeline).  I think you mentioned in the past that these are being
> > used internally on an engine that your company is working on, but if that
>
> > is the only consumer it makes me wonder if the algorithm development might
> > be better served as part of that engine.
> >
> > 2.  If we do move forward with this, we also need a plan for how to
> > optimize the algorithms to avoid virtual calls.  There are two high-level
> > approaches template-based and (byte)code generation based.  Both aren't
>
> > applicable in all situations but it would be good to come consensus on when
> > (and when not to) use each.
> >
> > Thanks,
> > Micah
> >
> > [1]
> >
> >
> https://github.com/dremio/dremio-oss/tree/master/sabot/kernel/src/main/java/com/dremio/sabot/op/sort/external
> >
> > On Tue, Sep 24, 2019 at 6:48 AM Fan Liya <li...@gmail.com> wrote:
> >
> > > Hi Micah,
> > >
> > > Thanks for your effort and precious time.
> > > Looking forward to receiving more valuable feedback from you.
> > >
> > > Best,
> > > Liya Fan
> > >
> > > On Tue, Sep 24, 2019 at 2:12 PM Micah Kornfield <emkornfield@gmail.com
> >
> > > wrote:
> > >
> > >> Hi Liya Fan,
> > >> I started reviewing but haven't gotten all the way through it. I will
> > try
> > >> to leave more comments over the next few days.
> > >>
> > >> Thanks again for the write-up I think it will help frame a productive
> > >> conversation.
> > >>
> > >> -Micah
> > >>
> > >> On Tue, Sep 17, 2019 at 1:47 AM Fan Liya <liya.fan03@gmail.com
> > wrote:
> > >>
> > >>> Hi Micah,
> > >>>
> > >>> Thanks for your kind reminder. Comments are enabled now.
> > >>>
> > >>> Best,
> > >>> Liya Fan
> > >>>
> > >>> On Tue, Sep 17, 2019 at 12:45 PM Micah Kornfield <
> > emkornfield@gmail.com>
> > >>> wrote:
> > >>>
> > >>>> Hi Liya Fan,
>
> > >>>> Thank you for this writeup, it doesn't look like comments are enabled
> > on
> > >>>> the document.  Could you allow for them?
> > >>>>
> > >>>> Thanks,
> > >>>> Micah
> > >>>>
> > >>>> On Sat, Sep 14, 2019 at 6:57 AM Fan Liya <li...@gmail.com>
> > wrote:
> > >>>>
> > >>>> > Dear all,
> > >>>> >
>
> > >>>> > We have prepared a document for discussing the requirements, design
> > >>>> and
> > >>>> > implementation issues for the algorithm module of Java:
> > >>>> >
> > >>>> >
> > >>>> >
> > >>>>
> >
> https://docs.google.com/document/d/17nqHWS7gs0vARfeDAcUEbhKMOYHnCtA46TOY_Nls69s/edit?usp=sharing
> > >>>> >
> > >>>> > So far, we have finished the initial draft for sort, search and
> > >>>> dictionary
>
> > >>>> > encoding algorithms. Discussions for more algorithms may be added in
> > >>>> the
> > >>>> > future. This document will keep evolving to reflect the latest
> > >>>> discussion
> > >>>> > results in the community and the latest code changes.
> > >>>> >
> > >>>> > Please give your valuable feedback.
> > >>>> >
> > >>>> > Best,
> > >>>> > Liya Fan
> > >>>> >
> > >>>>
> > >>>
> >
>
>
>

Re: [DISCUSS][Java] Design of the algorithm module

Posted by Micah Kornfield <em...@gmail.com>.

>
> To save the effort, or invest it to higher priority issues, we plan to:
> 1. We will stop providing "additional algorithms", unless they are
> explictly required.

This sounds reasonable, we can also evaluate on a case-by-case basis on how
widely applicable some are.

2. For existing addition algorithms in our code base, we will stop
> improving them.

OK, I'm a little afraid of bit-rot here, but we can see you things go.

Cheers,
Micah


On Tue, Oct 22, 2019 at 7:09 PM Fan Liya <li...@gmail.com> wrote:

> Hi Micah,
>
> Thank you for reading through my previous email.
>
> > Is the conversation about rejecting the changes in Flink something you
> can link to? I found [1] which seems to allow for Arrow, in what seem like
> reasonable places, just not inside the core planner (and even that is a
> possibility with a proper PoC).  However, I don't think the algorithms
> proposed here are directly related to those discussions.
>
> There is a short discussion [1] in the ML. Please note that our proposal
> is not officially "rejected". It is just ignored silently (in fact, this
> makes no difference to us). We have had some conferences/discussions with
> the Flink commiters and founders, it seems they like ideas, but no progress
> has been made so far, because the change is too large and too risky. The
> other issue you have indicated [2] represents another (earlier) attempt to
> incorporate Arrow to Flink. However, that issue has no progress either.
>
> > I don't agree with this conclusion.  Apache Drill, where most of the
> Java code came from has been around for longer period of time.  Also, even
> without Arrow being around, columnar vs row based DB engines, is design
> decision that has nothing to do with existing open source projects.  Does
> Flink use another open source library for its row representation?
>
> I think you mean that, row vs. columnar representations and open source
> project selection are two independent issues. I agree with you.
> Flink has its own implementation for row store, although I think they
> should use Arrow directly (if it were available earlier), as columnar store
> is the mainstream.
>
> > I think this circles back around to my original points:
> >  1.  Which users are we expecting to use the algorithms package that
> aren't directly related to data transport in Java (i.e. additional
> algorithms)?  In many cases the algorithms seem like they would be query
> engine specific.  I haven't seen much evidence that there are users of the
> Java code base that need all these algorithms.
> >  2.  Contributions to any project consume resources and peoples' time.
> If there is only going to be one user of the code it might not belong in
> Arrow "proper" due to these hurdles.
>
> I agree with you that contributing code consumes lots of effort, and we
> should only provide general algorithms.
>
> To save the effort, or invest it to higher priority issues, we plan to:
> 1. We will stop providing "additional algorithms", unless they are
> explictly required.
> 2. For existing addition algorithms in our code base, we will stop
> improving them.
>
> Thanks again for your effort in reviewing algorithms and all the good
> review comments.
>
> Best,
> Liya Fan
>
>
> [1] http://mail-archives.apache.org/mod_mbox/flink-dev/201907.mbox/browser
> [2] https://issues.apache.org/jira/browse/FLINK-10929
>
> On Sun, Oct 20, 2019 at 12:05 PM Micah Kornfield <em...@gmail.com>
> wrote:
>
>> Hi Liya Fan,
>> Is the conversation about rejecting the changes in Flink something you
>> can link to? I found [1] which seems to allow for Arrow, in what seem like
>> reasonable places, just not inside the core planner (and even that is a
>> possibility with a proper PoC).  However, I don't think the algorithms
>> proposed here are directly related to those discussions.
>>
>> I think the lesson learned is that, we should provide some features
>>> proactively (at least the general features), and make them good enough.
>>> Apache Flink was started around 2015, and Arrow's Java project was started
>>> in 2016. If Arrow were made available earlier, maybe Flink would have
>>> chosen it in the first place.
>>
>>
>> I don't agree with this conclusion.  Apache Drill, where most of the Java
>> code came from has been around for longer period of time.  Also, even
>> without Arrow being around, columnar vs row based DB engines, is design
>> decision that has nothing to do with existing open source projects.  Does
>> Flink use another open source library for its row representation?
>>
>> When a users needs a algorithm, it may be already too late. AFAIK, most
>>> users will choose to implement one by themselves, rather than openning a
>>> JIRA in the community. It takes a long time to provide a PR, review the
>>> code, merge the code, and wait for the next release.
>>
>>
>> I think this circles back around to my original points:
>>   1.  Which users are we expecting to use the algorithms package that
>> aren't directly related to data transport in Java (i.e. additional
>> algorithms)?  In many cases the algorithms seem like they would be query
>> engine specific.  I haven't seen much evidence that there are users of the
>> Java code base that need all these algorithms.
>>   2.  Contributions to any project consume resources and peoples' time.
>> If there is only going to be one user of the code it might not belong in
>> Arrow "proper" due to these hurdles.
>>
>> Thanks,
>> Micah
>>
>> [1] https://issues.apache.org/jira/browse/FLINK-10929
>>
>> On Mon, Oct 14, 2019 at 10:38 PM Fan Liya <li...@gmail.com> wrote:
>>
>>> Hi Micah,
>>>
>>> Thanks a lot for your valuable comments.
>>> I mostly agree with you points.
>>>
>>> Please note that in addition to algorithms directly related to Arrow
>>> features, there are algorithms indirectly related to Arrow features, and
>>> they should be attached a lower priority. An example is IPC -> dictionary
>>> encoding -> vector sort. I agree that these features should adhere to
>>> requirements of Arrow specification features.
>>>
>>> I also want to discuss why developers/users are not using some of the
>>> algorithms (IMO):
>>>
>>> 1. To let them use the algorithms, they must be available and the users
>>> must be aware of them. Our first algorithm was published only 3 months ago,
>>> in 0.14, which has very limited functionalities. In 0.15, we have more
>>> functionalities, but 0.15 has been published for no more than 10 days.
>>>
>>> 2. To let them use the algorithms, they must be good enough. In
>>> particular,
>>>   1) They must be performant. So far for us, performance improvement has
>>> started, and there is still much room for further improvement.
>>>   2) They must be functionally complete. We have not reached this goal
>>> yet. For example, we do not support sorting all vector types.
>>>   3) The algorithms should be easy to use.
>>>
>>> 3. Some SQL engines rely on native code implementations. For example,
>>> Dremio has gandiva which is based on LLVM. For such scenarios, we do not
>>> recommend them to use our algorithms.
>>>
>>> 4. Some SQL engines rely on Java implementations, but they do not rely
>>> on Arrow (e.g. Drill, Flink, etc.). I think this issue (convince them to
>>> use Arrow) should be placed with a higher priority. If these engines rely
>>> on Arrow, it would be more likely that they will use Arrow algorithms
>>> (provided that the algorithms are good enough).
>>>
>>> Concerning the last point, I want to share our experience when
>>> introducing Arrow to another project. (I am not sure if it is appropriate
>>> to discuss it here. Maybe you can give us some advice.)
>>>
>>> Apache Flink's runtime is based on Java but not on Arrow. We have a
>>> private Flink branch in our team, which is based on Arrow. Compared with
>>> the open source edition, our Arrow-based edition provides higher
>>> performance. It improves the performance by 30% for the TPCH 1TB benchmark
>>> (Please see Section 4 of [1]). We wanted to contribute the changes to the
>>> Flink community and convince them to use Arrow in their core engine.
>>> However, at least for now, they have not accepted the proposal.
>>>
>>> The main reason is that the changes are too big, which is too risky:
>>> many underlying data structures, algorithms and the framework must be
>>> changed. They admit that using Arrow is better, and they are aware of the
>>> performance improvements, but they are just unwilling to take the risk.
>>>
>>> I think the lesson learned is that, we should provide some features
>>> proactively (at least the general features), and make them good enough.
>>> Apache Flink was started around 2015, and Arrow's Java project was started
>>> in 2016. If Arrow were made available earlier, maybe Flink would have
>>> chosen it in the first place.
>>>
>>> When a users needs a algorithm, it may be already too late. AFAIK, most
>>> users will choose to implement one by themselves, rather than openning a
>>> JIRA in the community. It takes a long time to provide a PR, review the
>>> code, merge the code, and wait for the next release.
>>>
>>> Therefore, I think what we should do is to try all means to make Arrow
>>> better, by providing general functionalities, by making them performant, by
>>> making them functionally complete and making them easier to use. By making
>>> Arrow better, I believe more users will chose Arrow. When trust is
>>> established, more users will switch to Arrow.
>>>
>>> Best,
>>> Liya Fan
>>>
>>> [1]
>>> https://docs.google.com/document/d/1cUHb-_Pbe4NMU3Igwt4tytEmI66jQxev00IL99e2wFY/edit#heading=h.50xdeg1htedb
>>>
>>> [2] https://issues.apache.org/jira/browse/FLINK-13053
>>>
>>>
>>> On Mon, Oct 14, 2019 at 5:46 AM Micah Kornfield <em...@gmail.com>
>>> wrote:
>>>
>>>> Hi Liya Fan,
>>>>
>>>>> I think the algorithms should be better termed "micro-algorithms".
>>>>> They are termed "micro" in the sense that they do not directly compose a
>>>>> query engine, because they only provide primitive functionalities (e.g.
>>>>> vector sort).
>>>>> Instead, they can be used as building blocks for query engines.  The
>>>>> major benefit of the micro-algorithms is their generality: they can be used
>>>>> in wide ranges of common scenarios. They can be used in more than one query
>>>>> engine. In addition, there are other common scenarios, like vector data
>>>>> compression/decompression (e.g. dictionary encoding and RLE encoding, as we
>>>>> have already supported/discussed), IPC communication, data analysis, data
>>>>> mining, etc.
>>>>
>>>>
>>>> I agree the algorithm can be generally useful.  But I still have
>>>> concerns about who is going to use them.
>>>>
>>>> I think there are two categories the algorithms fall into:
>>>> 1.  Algorithms directly related to Arrow specification features.  For
>>>> these, I agree some of functionality will be needed as a reference
>>>> implementation.  At least for existing functionality I think there is
>>>> already sufficient coverage and in some cases (i.e. dictionary there is
>>>> already) duplicate coverage.
>>>>
>>>> 2.  Other algorithms -  I think these fall into "data analysis, data
>>>> mining, etc.", and for these I think it goes back to the question, of
>>>> whether developers/users would use the given algorithms to build there own
>>>> one-off analysis or use already existing tools like Apache Spark or
>>>> SQL-engine that already incorporates the algorithms.
>>>>
>>>> I'm little disappointed that more maintainers/developers haven't given
>>>> there input on this topic.  I hope some will help with the work involved in
>>>> reviewing them if they find them valuable.
>>>>
>>>> Thanks,
>>>> Micah
>>>>
>>>>
>>>> On Fri, Oct 4, 2019 at 11:59 PM fan_li_ya <fa...@aliyun.com> wrote:
>>>>
>>>>> Hi Micah and Praveen,
>>>>>
>>>>> Thanks a lot for your valuable feedback.
>>>>>
>>>>> My thoughts on the problems:
>>>>>
>>>>> 1. About audiance of the algorithms:
>>>>>
>>>>> I think the algorithms should be better termed "micro-algorithms".
>>>>> They are termed "micro" in the sense that they do not directly compose a
>>>>> query engine, because they only provide primitive functionalities (e.g.
>>>>> vector sort).
>>>>> Instead, they can be used as building blocks for query engines.  The
>>>>> major benefit of the micro-algorithms is their generality: they can be used
>>>>> in wide ranges of common scenarios. They can be used in more than one query
>>>>> engine. In addition, there are other common scenarios, like vector data
>>>>> compression/decompression (e.g. dictionary encoding and RLE encoding, as we
>>>>> have already supported/discussed), IPC communication, data analysis, data
>>>>> mining, etc.
>>>>>
>>>>> 2. About performance improvments:
>>>>>
>>>>> Code generation and template types are powerful tools. In addition,
>>>>> JIT is also a powerful tool, as it can inline megamorphic virtual functions
>>>>> for many scenarios, if the algorithm is implemented appropriately.
>>>>> IMO, code generation is applicable to almost all scenarios to achieve
>>>>> good performance, if we are willing to pay the price of code readability.
>>>>> I will try to detail the principles for choosing these tools for
>>>>> performance improvements later.
>>>>>
>>>>> Best,
>>>>> Liya Fan
>>>>>
>>>>> ------------------------------------------------------------------
>>>>> 发件人：Praveen Kumar <pr...@dremio.com>
>>>>> 发送时间：2019年10月4日(星期五) 19:20
>>>>> 收件人：Micah Kornfield <em...@gmail.com>
>>>>> 抄 送：Fan Liya <li...@gmail.com>; dev <de...@arrow.apache.org>
>>>>> 主 题：Re: [DISCUSS][Java] Design of the algorithm module
>>>>>
>>>>> Hi Micah,
>>>>>
>>>>>
>>>>> I agree with 1., i think as an end user, what they would really want is a
>>>>> query/data processing engine. I am not sure how easy/relevant the
>>>>> algorithms will be in the absence of the engine. For e.g. most of these
>>>>> operators would need to pipelined, handle memory, distribution etc. So
>>>>> bundling this along with engine makes a lot more sense, the interfaces
>>>>> required might be a bit different too for that.
>>>>>
>>>>> Thx.
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Oct 3, 2019 at 10:27 AM Micah Kornfield <emkornfield@gmail.com
>>>>> >
>>>>> wrote:
>>>>>
>>>>> > Hi Liya Fan,
>>>>> > Thanks again for writing this up.  I think it provides a road-map for
>>>>>
>>>>> > intended features.  I commented on the document but I wanted to raise a few
>>>>>
>>>>> > high-level concerns here as well to get more feedback from the community.
>>>>> >
>>>>>
>>>>> > 1.  It isn't clear to me who the users will of this will be.  My perception
>>>>>
>>>>> > is that in the Java ecosystem there aren't use-cases for the algorithms
>>>>>
>>>>> > outside of specific compute engines.  I'm not super involved in open-source
>>>>>
>>>>> > Java these days so I would love to hear others opinions. For instance, I'm
>>>>>
>>>>> > not sure if Dremio would switch to using these algorithms instead of the
>>>>>
>>>>> > ones they've already open-sourced  [1] and Apache Spark I believe is only
>>>>>
>>>>> > using Arrow for interfacing with Python (they similarly have there own
>>>>>
>>>>> > compute pipeline).  I think you mentioned in the past that these are being
>>>>>
>>>>> > used internally on an engine that your company is working on, but if that
>>>>>
>>>>> > is the only consumer it makes me wonder if the algorithm development might
>>>>> > be better served as part of that engine.
>>>>> >
>>>>> > 2.  If we do move forward with this, we also need a plan for how to
>>>>>
>>>>> > optimize the algorithms to avoid virtual calls.  There are two high-level
>>>>>
>>>>> > approaches template-based and (byte)code generation based.  Both aren't
>>>>>
>>>>> > applicable in all situations but it would be good to come consensus on when
>>>>> > (and when not to) use each.
>>>>> >
>>>>> > Thanks,
>>>>> > Micah
>>>>> >
>>>>> > [1]
>>>>> >
>>>>> >
>>>>> https://github.com/dremio/dremio-oss/tree/master/sabot/kernel/src/main/java/com/dremio/sabot/op/sort/external
>>>>> >
>>>>> > On Tue, Sep 24, 2019 at 6:48 AM Fan Liya <liya.fan03@gmail.com
>>>>> > wrote:
>>>>> >
>>>>> > > Hi Micah,
>>>>> > >
>>>>> > > Thanks for your effort and precious time.
>>>>> > > Looking forward to receiving more valuable feedback from you.
>>>>> > >
>>>>> > > Best,
>>>>> > > Liya Fan
>>>>> > >
>>>>> > > On Tue, Sep 24, 2019 at 2:12 PM Micah Kornfield <
>>>>> emkornfield@gmail.com>
>>>>> > > wrote:
>>>>> > >
>>>>> > >> Hi Liya Fan,
>>>>>
>>>>> > >> I started reviewing but haven't gotten all the way through it. I will
>>>>> > try
>>>>> > >> to leave more comments over the next few days.
>>>>> > >>
>>>>>
>>>>> > >> Thanks again for the write-up I think it will help frame a productive
>>>>> > >> conversation.
>>>>> > >>
>>>>> > >> -Micah
>>>>> > >>
>>>>> > >> On Tue, Sep 17, 2019 at 1:47 AM Fan Liya <liya.fan03@gmail.com
>>>>> > wrote:
>>>>> > >>
>>>>> > >>> Hi Micah,
>>>>> > >>>
>>>>> > >>> Thanks for your kind reminder. Comments are enabled now.
>>>>> > >>>
>>>>> > >>> Best,
>>>>> > >>> Liya Fan
>>>>> > >>>
>>>>> > >>> On Tue, Sep 17, 2019 at 12:45 PM Micah Kornfield <
>>>>> > emkornfield@gmail.com>
>>>>> > >>> wrote:
>>>>> > >>>
>>>>> > >>>> Hi Liya Fan,
>>>>>
>>>>> > >>>> Thank you for this writeup, it doesn't look like comments are enabled
>>>>> > on
>>>>> > >>>> the document.  Could you allow for them?
>>>>> > >>>>
>>>>> > >>>> Thanks,
>>>>> > >>>> Micah
>>>>> > >>>>
>>>>> > >>>> On Sat, Sep 14, 2019 at 6:57 AM Fan Liya <li...@gmail.com>
>>>>> > wrote:
>>>>> > >>>>
>>>>> > >>>> > Dear all,
>>>>> > >>>> >
>>>>>
>>>>> > >>>> > We have prepared a document for discussing the requirements, design
>>>>> > >>>> and
>>>>> > >>>> > implementation issues for the algorithm module of Java:
>>>>> > >>>> >
>>>>> > >>>> >
>>>>> > >>>> >
>>>>> > >>>>
>>>>> >
>>>>> https://docs.google.com/document/d/17nqHWS7gs0vARfeDAcUEbhKMOYHnCtA46TOY_Nls69s/edit?usp=sharing
>>>>> > >>>> >
>>>>>
>>>>> > >>>> > So far, we have finished the initial draft for sort, search and
>>>>> > >>>> dictionary
>>>>>
>>>>> > >>>> > encoding algorithms. Discussions for more algorithms may be added in
>>>>> > >>>> the
>>>>> > >>>> > future. This document will keep evolving to reflect the latest
>>>>> > >>>> discussion
>>>>> > >>>> > results in the community and the latest code changes.
>>>>> > >>>> >
>>>>> > >>>> > Please give your valuable feedback.
>>>>> > >>>> >
>>>>> > >>>> > Best,
>>>>> > >>>> > Liya Fan
>>>>> > >>>> >
>>>>> > >>>>
>>>>> > >>>
>>>>> >
>>>>>
>>>>>
>>>>>

Re: [DISCUSS][Java] Design of the algorithm module

Posted by Fan Liya <li...@gmail.com>.

Hi Micah,

Thank you for reading through my previous email.

> Is the conversation about rejecting the changes in Flink something you
can link to? I found [1] which seems to allow for Arrow, in what seem like
reasonable places, just not inside the core planner (and even that is a
possibility with a proper PoC).  However, I don't think the algorithms
proposed here are directly related to those discussions.

There is a short discussion [1] in the ML. Please note that our proposal is
not officially "rejected". It is just ignored silently (in fact, this makes
no difference to us). We have had some conferences/discussions with the
Flink commiters and founders, it seems they like ideas, but no progress has
been made so far, because the change is too large and too risky. The other
issue you have indicated [2] represents another (earlier) attempt to
incorporate Arrow to Flink. However, that issue has no progress either.

> I don't agree with this conclusion.  Apache Drill, where most of the Java
code came from has been around for longer period of time.  Also, even
without Arrow being around, columnar vs row based DB engines, is design
decision that has nothing to do with existing open source projects.  Does
Flink use another open source library for its row representation?

I think you mean that, row vs. columnar representations and open source
project selection are two independent issues. I agree with you.
Flink has its own implementation for row store, although I think they
should use Arrow directly (if it were available earlier), as columnar store
is the mainstream.

> I think this circles back around to my original points:
>  1.  Which users are we expecting to use the algorithms package that
aren't directly related to data transport in Java (i.e. additional
algorithms)?  In many cases the algorithms seem like they would be query
engine specific.  I haven't seen much evidence that there are users of the
Java code base that need all these algorithms.
>  2.  Contributions to any project consume resources and peoples' time. If
there is only going to be one user of the code it might not belong in Arrow
"proper" due to these hurdles.

I agree with you that contributing code consumes lots of effort, and we
should only provide general algorithms.

To save the effort, or invest it to higher priority issues, we plan to:
1. We will stop providing "additional algorithms", unless they are
explictly required.
2. For existing addition algorithms in our code base, we will stop
improving them.

Thanks again for your effort in reviewing algorithms and all the good
review comments.

Best,
Liya Fan


[1] http://mail-archives.apache.org/mod_mbox/flink-dev/201907.mbox/browser
[2] https://issues.apache.org/jira/browse/FLINK-10929

On Sun, Oct 20, 2019 at 12:05 PM Micah Kornfield <em...@gmail.com>
wrote:

> Hi Liya Fan,
> Is the conversation about rejecting the changes in Flink something you can
> link to? I found [1] which seems to allow for Arrow, in what seem like
> reasonable places, just not inside the core planner (and even that is a
> possibility with a proper PoC).  However, I don't think the algorithms
> proposed here are directly related to those discussions.
>
> I think the lesson learned is that, we should provide some features
>> proactively (at least the general features), and make them good enough.
>> Apache Flink was started around 2015, and Arrow's Java project was started
>> in 2016. If Arrow were made available earlier, maybe Flink would have
>> chosen it in the first place.
>
>
> I don't agree with this conclusion.  Apache Drill, where most of the Java
> code came from has been around for longer period of time.  Also, even
> without Arrow being around, columnar vs row based DB engines, is design
> decision that has nothing to do with existing open source projects.  Does
> Flink use another open source library for its row representation?
>
> When a users needs a algorithm, it may be already too late. AFAIK, most
>> users will choose to implement one by themselves, rather than openning a
>> JIRA in the community. It takes a long time to provide a PR, review the
>> code, merge the code, and wait for the next release.
>
>
> I think this circles back around to my original points:
>   1.  Which users are we expecting to use the algorithms package that
> aren't directly related to data transport in Java (i.e. additional
> algorithms)?  In many cases the algorithms seem like they would be query
> engine specific.  I haven't seen much evidence that there are users of the
> Java code base that need all these algorithms.
>   2.  Contributions to any project consume resources and peoples' time. If
> there is only going to be one user of the code it might not belong in Arrow
> "proper" due to these hurdles.
>
> Thanks,
> Micah
>
> [1] https://issues.apache.org/jira/browse/FLINK-10929
>
> On Mon, Oct 14, 2019 at 10:38 PM Fan Liya <li...@gmail.com> wrote:
>
>> Hi Micah,
>>
>> Thanks a lot for your valuable comments.
>> I mostly agree with you points.
>>
>> Please note that in addition to algorithms directly related to Arrow
>> features, there are algorithms indirectly related to Arrow features, and
>> they should be attached a lower priority. An example is IPC -> dictionary
>> encoding -> vector sort. I agree that these features should adhere to
>> requirements of Arrow specification features.
>>
>> I also want to discuss why developers/users are not using some of the
>> algorithms (IMO):
>>
>> 1. To let them use the algorithms, they must be available and the users
>> must be aware of them. Our first algorithm was published only 3 months ago,
>> in 0.14, which has very limited functionalities. In 0.15, we have more
>> functionalities, but 0.15 has been published for no more than 10 days.
>>
>> 2. To let them use the algorithms, they must be good enough. In
>> particular,
>>   1) They must be performant. So far for us, performance improvement has
>> started, and there is still much room for further improvement.
>>   2) They must be functionally complete. We have not reached this goal
>> yet. For example, we do not support sorting all vector types.
>>   3) The algorithms should be easy to use.
>>
>> 3. Some SQL engines rely on native code implementations. For example,
>> Dremio has gandiva which is based on LLVM. For such scenarios, we do not
>> recommend them to use our algorithms.
>>
>> 4. Some SQL engines rely on Java implementations, but they do not rely on
>> Arrow (e.g. Drill, Flink, etc.). I think this issue (convince them to use
>> Arrow) should be placed with a higher priority. If these engines rely on
>> Arrow, it would be more likely that they will use Arrow algorithms
>> (provided that the algorithms are good enough).
>>
>> Concerning the last point, I want to share our experience when
>> introducing Arrow to another project. (I am not sure if it is appropriate
>> to discuss it here. Maybe you can give us some advice.)
>>
>> Apache Flink's runtime is based on Java but not on Arrow. We have a
>> private Flink branch in our team, which is based on Arrow. Compared with
>> the open source edition, our Arrow-based edition provides higher
>> performance. It improves the performance by 30% for the TPCH 1TB benchmark
>> (Please see Section 4 of [1]). We wanted to contribute the changes to the
>> Flink community and convince them to use Arrow in their core engine.
>> However, at least for now, they have not accepted the proposal.
>>
>> The main reason is that the changes are too big, which is too risky: many
>> underlying data structures, algorithms and the framework must be changed.
>> They admit that using Arrow is better, and they are aware of the
>> performance improvements, but they are just unwilling to take the risk.
>>
>> I think the lesson learned is that, we should provide some features
>> proactively (at least the general features), and make them good enough.
>> Apache Flink was started around 2015, and Arrow's Java project was started
>> in 2016. If Arrow were made available earlier, maybe Flink would have
>> chosen it in the first place.
>>
>> When a users needs a algorithm, it may be already too late. AFAIK, most
>> users will choose to implement one by themselves, rather than openning a
>> JIRA in the community. It takes a long time to provide a PR, review the
>> code, merge the code, and wait for the next release.
>>
>> Therefore, I think what we should do is to try all means to make Arrow
>> better, by providing general functionalities, by making them performant, by
>> making them functionally complete and making them easier to use. By making
>> Arrow better, I believe more users will chose Arrow. When trust is
>> established, more users will switch to Arrow.
>>
>> Best,
>> Liya Fan
>>
>> [1]
>> https://docs.google.com/document/d/1cUHb-_Pbe4NMU3Igwt4tytEmI66jQxev00IL99e2wFY/edit#heading=h.50xdeg1htedb
>>
>> [2] https://issues.apache.org/jira/browse/FLINK-13053
>>
>>
>> On Mon, Oct 14, 2019 at 5:46 AM Micah Kornfield <em...@gmail.com>
>> wrote:
>>
>>> Hi Liya Fan,
>>>
>>>> I think the algorithms should be better termed "micro-algorithms". They
>>>> are termed "micro" in the sense that they do not directly compose a query
>>>> engine, because they only provide primitive functionalities (e.g. vector
>>>> sort).
>>>> Instead, they can be used as building blocks for query engines.  The
>>>> major benefit of the micro-algorithms is their generality: they can be used
>>>> in wide ranges of common scenarios. They can be used in more than one query
>>>> engine. In addition, there are other common scenarios, like vector data
>>>> compression/decompression (e.g. dictionary encoding and RLE encoding, as we
>>>> have already supported/discussed), IPC communication, data analysis, data
>>>> mining, etc.
>>>
>>>
>>> I agree the algorithm can be generally useful.  But I still have
>>> concerns about who is going to use them.
>>>
>>> I think there are two categories the algorithms fall into:
>>> 1.  Algorithms directly related to Arrow specification features.  For
>>> these, I agree some of functionality will be needed as a reference
>>> implementation.  At least for existing functionality I think there is
>>> already sufficient coverage and in some cases (i.e. dictionary there is
>>> already) duplicate coverage.
>>>
>>> 2.  Other algorithms -  I think these fall into "data analysis, data
>>> mining, etc.", and for these I think it goes back to the question, of
>>> whether developers/users would use the given algorithms to build there own
>>> one-off analysis or use already existing tools like Apache Spark or
>>> SQL-engine that already incorporates the algorithms.
>>>
>>> I'm little disappointed that more maintainers/developers haven't given
>>> there input on this topic.  I hope some will help with the work involved in
>>> reviewing them if they find them valuable.
>>>
>>> Thanks,
>>> Micah
>>>
>>>
>>> On Fri, Oct 4, 2019 at 11:59 PM fan_li_ya <fa...@aliyun.com> wrote:
>>>
>>>> Hi Micah and Praveen,
>>>>
>>>> Thanks a lot for your valuable feedback.
>>>>
>>>> My thoughts on the problems:
>>>>
>>>> 1. About audiance of the algorithms:
>>>>
>>>> I think the algorithms should be better termed "micro-algorithms". They
>>>> are termed "micro" in the sense that they do not directly compose a query
>>>> engine, because they only provide primitive functionalities (e.g. vector
>>>> sort).
>>>> Instead, they can be used as building blocks for query engines.  The
>>>> major benefit of the micro-algorithms is their generality: they can be used
>>>> in wide ranges of common scenarios. They can be used in more than one query
>>>> engine. In addition, there are other common scenarios, like vector data
>>>> compression/decompression (e.g. dictionary encoding and RLE encoding, as we
>>>> have already supported/discussed), IPC communication, data analysis, data
>>>> mining, etc.
>>>>
>>>> 2. About performance improvments:
>>>>
>>>> Code generation and template types are powerful tools. In addition, JIT
>>>> is also a powerful tool, as it can inline megamorphic virtual functions for
>>>> many scenarios, if the algorithm is implemented appropriately.
>>>> IMO, code generation is applicable to almost all scenarios to achieve
>>>> good performance, if we are willing to pay the price of code readability.
>>>> I will try to detail the principles for choosing these tools for
>>>> performance improvements later.
>>>>
>>>> Best,
>>>> Liya Fan
>>>>
>>>> ------------------------------------------------------------------
>>>> 发件人：Praveen Kumar <pr...@dremio.com>
>>>> 发送时间：2019年10月4日(星期五) 19:20
>>>> 收件人：Micah Kornfield <em...@gmail.com>
>>>> 抄 送：Fan Liya <li...@gmail.com>; dev <de...@arrow.apache.org>
>>>> 主 题：Re: [DISCUSS][Java] Design of the algorithm module
>>>>
>>>> Hi Micah,
>>>>
>>>>
>>>> I agree with 1., i think as an end user, what they would really want is a
>>>> query/data processing engine. I am not sure how easy/relevant the
>>>> algorithms will be in the absence of the engine. For e.g. most of these
>>>> operators would need to pipelined, handle memory, distribution etc. So
>>>> bundling this along with engine makes a lot more sense, the interfaces
>>>> required might be a bit different too for that.
>>>>
>>>> Thx.
>>>>
>>>>
>>>>
>>>> On Thu, Oct 3, 2019 at 10:27 AM Micah Kornfield <em...@gmail.com>
>>>> wrote:
>>>>
>>>> > Hi Liya Fan,
>>>> > Thanks again for writing this up.  I think it provides a road-map for
>>>>
>>>> > intended features.  I commented on the document but I wanted to raise a few
>>>>
>>>> > high-level concerns here as well to get more feedback from the community.
>>>> >
>>>>
>>>> > 1.  It isn't clear to me who the users will of this will be.  My perception
>>>>
>>>> > is that in the Java ecosystem there aren't use-cases for the algorithms
>>>>
>>>> > outside of specific compute engines.  I'm not super involved in open-source
>>>>
>>>> > Java these days so I would love to hear others opinions. For instance, I'm
>>>>
>>>> > not sure if Dremio would switch to using these algorithms instead of the
>>>>
>>>> > ones they've already open-sourced  [1] and Apache Spark I believe is only
>>>> > using Arrow for interfacing with Python (they similarly have there own
>>>>
>>>> > compute pipeline).  I think you mentioned in the past that these are being
>>>>
>>>> > used internally on an engine that your company is working on, but if that
>>>>
>>>> > is the only consumer it makes me wonder if the algorithm development might
>>>> > be better served as part of that engine.
>>>> >
>>>> > 2.  If we do move forward with this, we also need a plan for how to
>>>>
>>>> > optimize the algorithms to avoid virtual calls.  There are two high-level
>>>>
>>>> > approaches template-based and (byte)code generation based.  Both aren't
>>>>
>>>> > applicable in all situations but it would be good to come consensus on when
>>>> > (and when not to) use each.
>>>> >
>>>> > Thanks,
>>>> > Micah
>>>> >
>>>> > [1]
>>>> >
>>>> >
>>>> https://github.com/dremio/dremio-oss/tree/master/sabot/kernel/src/main/java/com/dremio/sabot/op/sort/external
>>>> >
>>>> > On Tue, Sep 24, 2019 at 6:48 AM Fan Liya <liya.fan03@gmail.com
>>>> > wrote:
>>>> >
>>>> > > Hi Micah,
>>>> > >
>>>> > > Thanks for your effort and precious time.
>>>> > > Looking forward to receiving more valuable feedback from you.
>>>> > >
>>>> > > Best,
>>>> > > Liya Fan
>>>> > >
>>>> > > On Tue, Sep 24, 2019 at 2:12 PM Micah Kornfield <
>>>> emkornfield@gmail.com>
>>>> > > wrote:
>>>> > >
>>>> > >> Hi Liya Fan,
>>>>
>>>> > >> I started reviewing but haven't gotten all the way through it. I will
>>>> > try
>>>> > >> to leave more comments over the next few days.
>>>> > >>
>>>>
>>>> > >> Thanks again for the write-up I think it will help frame a productive
>>>> > >> conversation.
>>>> > >>
>>>> > >> -Micah
>>>> > >>
>>>> > >> On Tue, Sep 17, 2019 at 1:47 AM Fan Liya <liya.fan03@gmail.com
>>>> > wrote:
>>>> > >>
>>>> > >>> Hi Micah,
>>>> > >>>
>>>> > >>> Thanks for your kind reminder. Comments are enabled now.
>>>> > >>>
>>>> > >>> Best,
>>>> > >>> Liya Fan
>>>> > >>>
>>>> > >>> On Tue, Sep 17, 2019 at 12:45 PM Micah Kornfield <
>>>> > emkornfield@gmail.com>
>>>> > >>> wrote:
>>>> > >>>
>>>> > >>>> Hi Liya Fan,
>>>>
>>>> > >>>> Thank you for this writeup, it doesn't look like comments are enabled
>>>> > on
>>>> > >>>> the document.  Could you allow for them?
>>>> > >>>>
>>>> > >>>> Thanks,
>>>> > >>>> Micah
>>>> > >>>>
>>>> > >>>> On Sat, Sep 14, 2019 at 6:57 AM Fan Liya <li...@gmail.com>
>>>> > wrote:
>>>> > >>>>
>>>> > >>>> > Dear all,
>>>> > >>>> >
>>>>
>>>> > >>>> > We have prepared a document for discussing the requirements, design
>>>> > >>>> and
>>>> > >>>> > implementation issues for the algorithm module of Java:
>>>> > >>>> >
>>>> > >>>> >
>>>> > >>>> >
>>>> > >>>>
>>>> >
>>>> https://docs.google.com/document/d/17nqHWS7gs0vARfeDAcUEbhKMOYHnCtA46TOY_Nls69s/edit?usp=sharing
>>>> > >>>> >
>>>> > >>>> > So far, we have finished the initial draft for sort, search and
>>>> > >>>> dictionary
>>>>
>>>> > >>>> > encoding algorithms. Discussions for more algorithms may be added in
>>>> > >>>> the
>>>> > >>>> > future. This document will keep evolving to reflect the latest
>>>> > >>>> discussion
>>>> > >>>> > results in the community and the latest code changes.
>>>> > >>>> >
>>>> > >>>> > Please give your valuable feedback.
>>>> > >>>> >
>>>> > >>>> > Best,
>>>> > >>>> > Liya Fan
>>>> > >>>> >
>>>> > >>>>
>>>> > >>>
>>>> >
>>>>
>>>>
>>>>

Re: [DISCUSS][Java] Design of the algorithm module

Posted by Micah Kornfield <em...@gmail.com>.

Hi Liya Fan,
Is the conversation about rejecting the changes in Flink something you can
link to? I found [1] which seems to allow for Arrow, in what seem like
reasonable places, just not inside the core planner (and even that is a
possibility with a proper PoC).  However, I don't think the algorithms
proposed here are directly related to those discussions.

I think the lesson learned is that, we should provide some features
> proactively (at least the general features), and make them good enough.
> Apache Flink was started around 2015, and Arrow's Java project was started
> in 2016. If Arrow were made available earlier, maybe Flink would have
> chosen it in the first place.


I don't agree with this conclusion.  Apache Drill, where most of the Java
code came from has been around for longer period of time.  Also, even
without Arrow being around, columnar vs row based DB engines, is design
decision that has nothing to do with existing open source projects.  Does
Flink use another open source library for its row representation?

When a users needs a algorithm, it may be already too late. AFAIK, most
> users will choose to implement one by themselves, rather than openning a
> JIRA in the community. It takes a long time to provide a PR, review the
> code, merge the code, and wait for the next release.


I think this circles back around to my original points:
  1.  Which users are we expecting to use the algorithms package that
aren't directly related to data transport in Java (i.e. additional
algorithms)?  In many cases the algorithms seem like they would be query
engine specific.  I haven't seen much evidence that there are users of the
Java code base that need all these algorithms.
  2.  Contributions to any project consume resources and peoples' time. If
there is only going to be one user of the code it might not belong in Arrow
"proper" due to these hurdles.

Thanks,
Micah

[1] https://issues.apache.org/jira/browse/FLINK-10929

On Mon, Oct 14, 2019 at 10:38 PM Fan Liya <li...@gmail.com> wrote:

> Hi Micah,
>
> Thanks a lot for your valuable comments.
> I mostly agree with you points.
>
> Please note that in addition to algorithms directly related to Arrow
> features, there are algorithms indirectly related to Arrow features, and
> they should be attached a lower priority. An example is IPC -> dictionary
> encoding -> vector sort. I agree that these features should adhere to
> requirements of Arrow specification features.
>
> I also want to discuss why developers/users are not using some of the
> algorithms (IMO):
>
> 1. To let them use the algorithms, they must be available and the users
> must be aware of them. Our first algorithm was published only 3 months ago,
> in 0.14, which has very limited functionalities. In 0.15, we have more
> functionalities, but 0.15 has been published for no more than 10 days.
>
> 2. To let them use the algorithms, they must be good enough. In particular,
>   1) They must be performant. So far for us, performance improvement has
> started, and there is still much room for further improvement.
>   2) They must be functionally complete. We have not reached this goal
> yet. For example, we do not support sorting all vector types.
>   3) The algorithms should be easy to use.
>
> 3. Some SQL engines rely on native code implementations. For example,
> Dremio has gandiva which is based on LLVM. For such scenarios, we do not
> recommend them to use our algorithms.
>
> 4. Some SQL engines rely on Java implementations, but they do not rely on
> Arrow (e.g. Drill, Flink, etc.). I think this issue (convince them to use
> Arrow) should be placed with a higher priority. If these engines rely on
> Arrow, it would be more likely that they will use Arrow algorithms
> (provided that the algorithms are good enough).
>
> Concerning the last point, I want to share our experience when introducing
> Arrow to another project. (I am not sure if it is appropriate to discuss it
> here. Maybe you can give us some advice.)
>
> Apache Flink's runtime is based on Java but not on Arrow. We have a
> private Flink branch in our team, which is based on Arrow. Compared with
> the open source edition, our Arrow-based edition provides higher
> performance. It improves the performance by 30% for the TPCH 1TB benchmark
> (Please see Section 4 of [1]). We wanted to contribute the changes to the
> Flink community and convince them to use Arrow in their core engine.
> However, at least for now, they have not accepted the proposal.
>
> The main reason is that the changes are too big, which is too risky: many
> underlying data structures, algorithms and the framework must be changed.
> They admit that using Arrow is better, and they are aware of the
> performance improvements, but they are just unwilling to take the risk.
>
> I think the lesson learned is that, we should provide some features
> proactively (at least the general features), and make them good enough.
> Apache Flink was started around 2015, and Arrow's Java project was started
> in 2016. If Arrow were made available earlier, maybe Flink would have
> chosen it in the first place.
>
> When a users needs a algorithm, it may be already too late. AFAIK, most
> users will choose to implement one by themselves, rather than openning a
> JIRA in the community. It takes a long time to provide a PR, review the
> code, merge the code, and wait for the next release.
>
> Therefore, I think what we should do is to try all means to make Arrow
> better, by providing general functionalities, by making them performant, by
> making them functionally complete and making them easier to use. By making
> Arrow better, I believe more users will chose Arrow. When trust is
> established, more users will switch to Arrow.
>
> Best,
> Liya Fan
>
> [1]
> https://docs.google.com/document/d/1cUHb-_Pbe4NMU3Igwt4tytEmI66jQxev00IL99e2wFY/edit#heading=h.50xdeg1htedb
>
> [2] https://issues.apache.org/jira/browse/FLINK-13053
>
>
> On Mon, Oct 14, 2019 at 5:46 AM Micah Kornfield <em...@gmail.com>
> wrote:
>
>> Hi Liya Fan,
>>
>>> I think the algorithms should be better termed "micro-algorithms". They
>>> are termed "micro" in the sense that they do not directly compose a query
>>> engine, because they only provide primitive functionalities (e.g. vector
>>> sort).
>>> Instead, they can be used as building blocks for query engines.  The
>>> major benefit of the micro-algorithms is their generality: they can be used
>>> in wide ranges of common scenarios. They can be used in more than one query
>>> engine. In addition, there are other common scenarios, like vector data
>>> compression/decompression (e.g. dictionary encoding and RLE encoding, as we
>>> have already supported/discussed), IPC communication, data analysis, data
>>> mining, etc.
>>
>>
>> I agree the algorithm can be generally useful.  But I still have concerns
>> about who is going to use them.
>>
>> I think there are two categories the algorithms fall into:
>> 1.  Algorithms directly related to Arrow specification features.  For
>> these, I agree some of functionality will be needed as a reference
>> implementation.  At least for existing functionality I think there is
>> already sufficient coverage and in some cases (i.e. dictionary there is
>> already) duplicate coverage.
>>
>> 2.  Other algorithms -  I think these fall into "data analysis, data
>> mining, etc.", and for these I think it goes back to the question, of
>> whether developers/users would use the given algorithms to build there own
>> one-off analysis or use already existing tools like Apache Spark or
>> SQL-engine that already incorporates the algorithms.
>>
>> I'm little disappointed that more maintainers/developers haven't given
>> there input on this topic.  I hope some will help with the work involved in
>> reviewing them if they find them valuable.
>>
>> Thanks,
>> Micah
>>
>>
>> On Fri, Oct 4, 2019 at 11:59 PM fan_li_ya <fa...@aliyun.com> wrote:
>>
>>> Hi Micah and Praveen,
>>>
>>> Thanks a lot for your valuable feedback.
>>>
>>> My thoughts on the problems:
>>>
>>> 1. About audiance of the algorithms:
>>>
>>> I think the algorithms should be better termed "micro-algorithms". They
>>> are termed "micro" in the sense that they do not directly compose a query
>>> engine, because they only provide primitive functionalities (e.g. vector
>>> sort).
>>> Instead, they can be used as building blocks for query engines.  The
>>> major benefit of the micro-algorithms is their generality: they can be used
>>> in wide ranges of common scenarios. They can be used in more than one query
>>> engine. In addition, there are other common scenarios, like vector data
>>> compression/decompression (e.g. dictionary encoding and RLE encoding, as we
>>> have already supported/discussed), IPC communication, data analysis, data
>>> mining, etc.
>>>
>>> 2. About performance improvments:
>>>
>>> Code generation and template types are powerful tools. In addition, JIT
>>> is also a powerful tool, as it can inline megamorphic virtual functions for
>>> many scenarios, if the algorithm is implemented appropriately.
>>> IMO, code generation is applicable to almost all scenarios to achieve
>>> good performance, if we are willing to pay the price of code readability.
>>> I will try to detail the principles for choosing these tools for
>>> performance improvements later.
>>>
>>> Best,
>>> Liya Fan
>>>
>>> ------------------------------------------------------------------
>>> 发件人：Praveen Kumar <pr...@dremio.com>
>>> 发送时间：2019年10月4日(星期五) 19:20
>>> 收件人：Micah Kornfield <em...@gmail.com>
>>> 抄 送：Fan Liya <li...@gmail.com>; dev <de...@arrow.apache.org>
>>> 主 题：Re: [DISCUSS][Java] Design of the algorithm module
>>>
>>> Hi Micah,
>>>
>>> I agree with 1., i think as an end user, what they would really want is a
>>> query/data processing engine. I am not sure how easy/relevant the
>>> algorithms will be in the absence of the engine. For e.g. most of these
>>> operators would need to pipelined, handle memory, distribution etc. So
>>> bundling this along with engine makes a lot more sense, the interfaces
>>> required might be a bit different too for that.
>>>
>>> Thx.
>>>
>>>
>>>
>>> On Thu, Oct 3, 2019 at 10:27 AM Micah Kornfield <em...@gmail.com>
>>> wrote:
>>>
>>> > Hi Liya Fan,
>>> > Thanks again for writing this up.  I think it provides a road-map for
>>>
>>> > intended features.  I commented on the document but I wanted to raise a few
>>>
>>> > high-level concerns here as well to get more feedback from the community.
>>> >
>>>
>>> > 1.  It isn't clear to me who the users will of this will be.  My perception
>>> > is that in the Java ecosystem there aren't use-cases for the algorithms
>>>
>>> > outside of specific compute engines.  I'm not super involved in open-source
>>>
>>> > Java these days so I would love to hear others opinions. For instance, I'm
>>>
>>> > not sure if Dremio would switch to using these algorithms instead of the
>>>
>>> > ones they've already open-sourced  [1] and Apache Spark I believe is only
>>> > using Arrow for interfacing with Python (they similarly have there own
>>>
>>> > compute pipeline).  I think you mentioned in the past that these are being
>>>
>>> > used internally on an engine that your company is working on, but if that
>>>
>>> > is the only consumer it makes me wonder if the algorithm development might
>>> > be better served as part of that engine.
>>> >
>>> > 2.  If we do move forward with this, we also need a plan for how to
>>>
>>> > optimize the algorithms to avoid virtual calls.  There are two high-level
>>> > approaches template-based and (byte)code generation based.  Both aren't
>>>
>>> > applicable in all situations but it would be good to come consensus on when
>>> > (and when not to) use each.
>>> >
>>> > Thanks,
>>> > Micah
>>> >
>>> > [1]
>>> >
>>> >
>>> https://github.com/dremio/dremio-oss/tree/master/sabot/kernel/src/main/java/com/dremio/sabot/op/sort/external
>>> >
>>> > On Tue, Sep 24, 2019 at 6:48 AM Fan Liya <li...@gmail.com> wrote:
>>> >
>>> > > Hi Micah,
>>> > >
>>> > > Thanks for your effort and precious time.
>>> > > Looking forward to receiving more valuable feedback from you.
>>> > >
>>> > > Best,
>>> > > Liya Fan
>>> > >
>>> > > On Tue, Sep 24, 2019 at 2:12 PM Micah Kornfield <
>>> emkornfield@gmail.com>
>>> > > wrote:
>>> > >
>>> > >> Hi Liya Fan,
>>>
>>> > >> I started reviewing but haven't gotten all the way through it. I will
>>> > try
>>> > >> to leave more comments over the next few days.
>>> > >>
>>>
>>> > >> Thanks again for the write-up I think it will help frame a productive
>>> > >> conversation.
>>> > >>
>>> > >> -Micah
>>> > >>
>>> > >> On Tue, Sep 17, 2019 at 1:47 AM Fan Liya <liya.fan03@gmail.com
>>> > wrote:
>>> > >>
>>> > >>> Hi Micah,
>>> > >>>
>>> > >>> Thanks for your kind reminder. Comments are enabled now.
>>> > >>>
>>> > >>> Best,
>>> > >>> Liya Fan
>>> > >>>
>>> > >>> On Tue, Sep 17, 2019 at 12:45 PM Micah Kornfield <
>>> > emkornfield@gmail.com>
>>> > >>> wrote:
>>> > >>>
>>> > >>>> Hi Liya Fan,
>>>
>>> > >>>> Thank you for this writeup, it doesn't look like comments are enabled
>>> > on
>>> > >>>> the document.  Could you allow for them?
>>> > >>>>
>>> > >>>> Thanks,
>>> > >>>> Micah
>>> > >>>>
>>> > >>>> On Sat, Sep 14, 2019 at 6:57 AM Fan Liya <li...@gmail.com>
>>> > wrote:
>>> > >>>>
>>> > >>>> > Dear all,
>>> > >>>> >
>>>
>>> > >>>> > We have prepared a document for discussing the requirements, design
>>> > >>>> and
>>> > >>>> > implementation issues for the algorithm module of Java:
>>> > >>>> >
>>> > >>>> >
>>> > >>>> >
>>> > >>>>
>>> >
>>> https://docs.google.com/document/d/17nqHWS7gs0vARfeDAcUEbhKMOYHnCtA46TOY_Nls69s/edit?usp=sharing
>>> > >>>> >
>>> > >>>> > So far, we have finished the initial draft for sort, search and
>>> > >>>> dictionary
>>>
>>> > >>>> > encoding algorithms. Discussions for more algorithms may be added in
>>> > >>>> the
>>> > >>>> > future. This document will keep evolving to reflect the latest
>>> > >>>> discussion
>>> > >>>> > results in the community and the latest code changes.
>>> > >>>> >
>>> > >>>> > Please give your valuable feedback.
>>> > >>>> >
>>> > >>>> > Best,
>>> > >>>> > Liya Fan
>>> > >>>> >
>>> > >>>>
>>> > >>>
>>> >
>>>
>>>
>>>

Re: [DISCUSS][Java] Design of the algorithm module

Posted by Fan Liya <li...@gmail.com>.

Hi Micah,

Thanks a lot for your valuable comments.
I mostly agree with you points.

Please note that in addition to algorithms directly related to Arrow
features, there are algorithms indirectly related to Arrow features, and
they should be attached a lower priority. An example is IPC -> dictionary
encoding -> vector sort. I agree that these features should adhere to
requirements of Arrow specification features.

I also want to discuss why developers/users are not using some of the
algorithms (IMO):

1. To let them use the algorithms, they must be available and the users
must be aware of them. Our first algorithm was published only 3 months ago,
in 0.14, which has very limited functionalities. In 0.15, we have more
functionalities, but 0.15 has been published for no more than 10 days.

2. To let them use the algorithms, they must be good enough. In particular,
  1) They must be performant. So far for us, performance improvement has
started, and there is still much room for further improvement.
  2) They must be functionally complete. We have not reached this goal yet.
For example, we do not support sorting all vector types.
  3) The algorithms should be easy to use.

3. Some SQL engines rely on native code implementations. For example,
Dremio has gandiva which is based on LLVM. For such scenarios, we do not
recommend them to use our algorithms.

4. Some SQL engines rely on Java implementations, but they do not rely on
Arrow (e.g. Drill, Flink, etc.). I think this issue (convince them to use
Arrow) should be placed with a higher priority. If these engines rely on
Arrow, it would be more likely that they will use Arrow algorithms
(provided that the algorithms are good enough).

Concerning the last point, I want to share our experience when introducing
Arrow to another project. (I am not sure if it is appropriate to discuss it
here. Maybe you can give us some advice.)

Apache Flink's runtime is based on Java but not on Arrow. We have a private
Flink branch in our team, which is based on Arrow. Compared with the open
source edition, our Arrow-based edition provides higher performance. It
improves the performance by 30% for the TPCH 1TB benchmark (Please see
Section 4 of [1]). We wanted to contribute the changes to the Flink
community and convince them to use Arrow in their core engine. However, at
least for now, they have not accepted the proposal.

The main reason is that the changes are too big, which is too risky: many
underlying data structures, algorithms and the framework must be changed.
They admit that using Arrow is better, and they are aware of the
performance improvements, but they are just unwilling to take the risk.

I think the lesson learned is that, we should provide some features
proactively (at least the general features), and make them good enough.
Apache Flink was started around 2015, and Arrow's Java project was started
in 2016. If Arrow were made available earlier, maybe Flink would have
chosen it in the first place.

When a users needs a algorithm, it may be already too late. AFAIK, most
users will choose to implement one by themselves, rather than openning a
JIRA in the community. It takes a long time to provide a PR, review the
code, merge the code, and wait for the next release.

Therefore, I think what we should do is to try all means to make Arrow
better, by providing general functionalities, by making them performant, by
making them functionally complete and making them easier to use. By making
Arrow better, I believe more users will chose Arrow. When trust is
established, more users will switch to Arrow.

Best,
Liya Fan

[1]
https://docs.google.com/document/d/1cUHb-_Pbe4NMU3Igwt4tytEmI66jQxev00IL99e2wFY/edit#heading=h.50xdeg1htedb

[2] https://issues.apache.org/jira/browse/FLINK-13053

On Mon, Oct 14, 2019 at 5:46 AM Micah Kornfield <em...@gmail.com>
wrote:

> Hi Liya Fan,
>
>> I think the algorithms should be better termed "micro-algorithms". They
>> are termed "micro" in the sense that they do not directly compose a query
>> engine, because they only provide primitive functionalities (e.g. vector
>> sort).
>> Instead, they can be used as building blocks for query engines.  The
>> major benefit of the micro-algorithms is their generality: they can be used
>> in wide ranges of common scenarios. They can be used in more than one query
>> engine. In addition, there are other common scenarios, like vector data
>> compression/decompression (e.g. dictionary encoding and RLE encoding, as we
>> have already supported/discussed), IPC communication, data analysis, data
>> mining, etc.
>
>
> I agree the algorithm can be generally useful.  But I still have concerns
> about who is going to use them.
>
> I think there are two categories the algorithms fall into:
> 1.  Algorithms directly related to Arrow specification features.  For
> these, I agree some of functionality will be needed as a reference
> implementation.  At least for existing functionality I think there is
> already sufficient coverage and in some cases (i.e. dictionary there is
> already) duplicate coverage.
>
> 2.  Other algorithms -  I think these fall into "data analysis, data
> mining, etc.", and for these I think it goes back to the question, of
> whether developers/users would use the given algorithms to build there own
> one-off analysis or use already existing tools like Apache Spark or
> SQL-engine that already incorporates the algorithms.
>
> I'm little disappointed that more maintainers/developers haven't given
> there input on this topic.  I hope some will help with the work involved in
> reviewing them if they find them valuable.
>
> Thanks,
> Micah
>
>
> On Fri, Oct 4, 2019 at 11:59 PM fan_li_ya <fa...@aliyun.com> wrote:
>
>> Hi Micah and Praveen,
>>
>> Thanks a lot for your valuable feedback.
>>
>> My thoughts on the problems:
>>
>> 1. About audiance of the algorithms:
>>
>> I think the algorithms should be better termed "micro-algorithms". They
>> are termed "micro" in the sense that they do not directly compose a query
>> engine, because they only provide primitive functionalities (e.g. vector
>> sort).
>> Instead, they can be used as building blocks for query engines.  The
>> major benefit of the micro-algorithms is their generality: they can be used
>> in wide ranges of common scenarios. They can be used in more than one query
>> engine. In addition, there are other common scenarios, like vector data
>> compression/decompression (e.g. dictionary encoding and RLE encoding, as we
>> have already supported/discussed), IPC communication, data analysis, data
>> mining, etc.
>>
>> 2. About performance improvments:
>>
>> Code generation and template types are powerful tools. In addition, JIT
>> is also a powerful tool, as it can inline megamorphic virtual functions for
>> many scenarios, if the algorithm is implemented appropriately.
>> IMO, code generation is applicable to almost all scenarios to achieve
>> good performance, if we are willing to pay the price of code readability.
>> I will try to detail the principles for choosing these tools for
>> performance improvements later.
>>
>> Best,
>> Liya Fan
>>
>> ------------------------------------------------------------------
>> 发件人：Praveen Kumar <pr...@dremio.com>
>> 发送时间：2019年10月4日(星期五) 19:20
>> 收件人：Micah Kornfield <em...@gmail.com>
>> 抄 送：Fan Liya <li...@gmail.com>; dev <de...@arrow.apache.org>
>> 主 题：Re: [DISCUSS][Java] Design of the algorithm module
>>
>> Hi Micah,
>>
>> I agree with 1., i think as an end user, what they would really want is a
>> query/data processing engine. I am not sure how easy/relevant the
>> algorithms will be in the absence of the engine. For e.g. most of these
>> operators would need to pipelined, handle memory, distribution etc. So
>> bundling this along with engine makes a lot more sense, the interfaces
>> required might be a bit different too for that.
>>
>> Thx.
>>
>>
>>
>> On Thu, Oct 3, 2019 at 10:27 AM Micah Kornfield <em...@gmail.com>
>> wrote:
>>
>> > Hi Liya Fan,
>> > Thanks again for writing this up.  I think it provides a road-map for
>>
>> > intended features.  I commented on the document but I wanted to raise a few
>>
>> > high-level concerns here as well to get more feedback from the community.
>> >
>>
>> > 1.  It isn't clear to me who the users will of this will be.  My perception
>> > is that in the Java ecosystem there aren't use-cases for the algorithms
>>
>> > outside of specific compute engines.  I'm not super involved in open-source
>>
>> > Java these days so I would love to hear others opinions. For instance, I'm
>> > not sure if Dremio would switch to using these algorithms instead of the
>>
>> > ones they've already open-sourced  [1] and Apache Spark I believe is only
>> > using Arrow for interfacing with Python (they similarly have there own
>>
>> > compute pipeline).  I think you mentioned in the past that these are being
>>
>> > used internally on an engine that your company is working on, but if that
>>
>> > is the only consumer it makes me wonder if the algorithm development might
>> > be better served as part of that engine.
>> >
>> > 2.  If we do move forward with this, we also need a plan for how to
>>
>> > optimize the algorithms to avoid virtual calls.  There are two high-level
>> > approaches template-based and (byte)code generation based.  Both aren't
>>
>> > applicable in all situations but it would be good to come consensus on when
>> > (and when not to) use each.
>> >
>> > Thanks,
>> > Micah
>> >
>> > [1]
>> >
>> >
>> https://github.com/dremio/dremio-oss/tree/master/sabot/kernel/src/main/java/com/dremio/sabot/op/sort/external
>> >
>> > On Tue, Sep 24, 2019 at 6:48 AM Fan Liya <li...@gmail.com> wrote:
>> >
>> > > Hi Micah,
>> > >
>> > > Thanks for your effort and precious time.
>> > > Looking forward to receiving more valuable feedback from you.
>> > >
>> > > Best,
>> > > Liya Fan
>> > >
>> > > On Tue, Sep 24, 2019 at 2:12 PM Micah Kornfield <
>> emkornfield@gmail.com>
>> > > wrote:
>> > >
>> > >> Hi Liya Fan,
>> > >> I started reviewing but haven't gotten all the way through it. I will
>> > try
>> > >> to leave more comments over the next few days.
>> > >>
>> > >> Thanks again for the write-up I think it will help frame a productive
>> > >> conversation.
>> > >>
>> > >> -Micah
>> > >>
>> > >> On Tue, Sep 17, 2019 at 1:47 AM Fan Liya <liya.fan03@gmail.com
>> > wrote:
>> > >>
>> > >>> Hi Micah,
>> > >>>
>> > >>> Thanks for your kind reminder. Comments are enabled now.
>> > >>>
>> > >>> Best,
>> > >>> Liya Fan
>> > >>>
>> > >>> On Tue, Sep 17, 2019 at 12:45 PM Micah Kornfield <
>> > emkornfield@gmail.com>
>> > >>> wrote:
>> > >>>
>> > >>>> Hi Liya Fan,
>>
>> > >>>> Thank you for this writeup, it doesn't look like comments are enabled
>> > on
>> > >>>> the document.  Could you allow for them?
>> > >>>>
>> > >>>> Thanks,
>> > >>>> Micah
>> > >>>>
>> > >>>> On Sat, Sep 14, 2019 at 6:57 AM Fan Liya <li...@gmail.com>
>> > wrote:
>> > >>>>
>> > >>>> > Dear all,
>> > >>>> >
>>
>> > >>>> > We have prepared a document for discussing the requirements, design
>> > >>>> and
>> > >>>> > implementation issues for the algorithm module of Java:
>> > >>>> >
>> > >>>> >
>> > >>>> >
>> > >>>>
>> >
>> https://docs.google.com/document/d/17nqHWS7gs0vARfeDAcUEbhKMOYHnCtA46TOY_Nls69s/edit?usp=sharing
>> > >>>> >
>> > >>>> > So far, we have finished the initial draft for sort, search and
>> > >>>> dictionary
>>
>> > >>>> > encoding algorithms. Discussions for more algorithms may be added in
>> > >>>> the
>> > >>>> > future. This document will keep evolving to reflect the latest
>> > >>>> discussion
>> > >>>> > results in the community and the latest code changes.
>> > >>>> >
>> > >>>> > Please give your valuable feedback.
>> > >>>> >
>> > >>>> > Best,
>> > >>>> > Liya Fan
>> > >>>> >
>> > >>>>
>> > >>>
>> >
>>
>>
>>

Re: [DISCUSS][Java] Design of the algorithm module

Posted by Micah Kornfield <em...@gmail.com>.

Hi Liya Fan,

> I think the algorithms should be better termed "micro-algorithms". They
> are termed "micro" in the sense that they do not directly compose a query
> engine, because they only provide primitive functionalities (e.g. vector
> sort).
> Instead, they can be used as building blocks for query engines.  The major
> benefit of the micro-algorithms is their generality: they can be used in
> wide ranges of common scenarios. They can be used in more than one query
> engine. In addition, there are other common scenarios, like vector data
> compression/decompression (e.g. dictionary encoding and RLE encoding, as we
> have already supported/discussed), IPC communication, data analysis, data
> mining, etc.


I agree the algorithm can be generally useful.  But I still have concerns
about who is going to use them.

I think there are two categories the algorithms fall into:
1.  Algorithms directly related to Arrow specification features.  For
these, I agree some of functionality will be needed as a reference
implementation.  At least for existing functionality I think there is
already sufficient coverage and in some cases (i.e. dictionary there is
already) duplicate coverage.

2.  Other algorithms -  I think these fall into "data analysis, data
mining, etc.", and for these I think it goes back to the question, of
whether developers/users would use the given algorithms to build there own
one-off analysis or use already existing tools like Apache Spark or
SQL-engine that already incorporates the algorithms.

I'm little disappointed that more maintainers/developers haven't given
there input on this topic.  I hope some will help with the work involved in
reviewing them if they find them valuable.

Thanks,
Micah


On Fri, Oct 4, 2019 at 11:59 PM fan_li_ya <fa...@aliyun.com> wrote:

> Hi Micah and Praveen,
>
> Thanks a lot for your valuable feedback.
>
> My thoughts on the problems:
>
> 1. About audiance of the algorithms:
>
> I think the algorithms should be better termed "micro-algorithms". They
> are termed "micro" in the sense that they do not directly compose a query
> engine, because they only provide primitive functionalities (e.g. vector
> sort).
> Instead, they can be used as building blocks for query engines.  The major
> benefit of the micro-algorithms is their generality: they can be used in
> wide ranges of common scenarios. They can be used in more than one query
> engine. In addition, there are other common scenarios, like vector data
> compression/decompression (e.g. dictionary encoding and RLE encoding, as we
> have already supported/discussed), IPC communication, data analysis, data
> mining, etc.
>
> 2. About performance improvments:
>
> Code generation and template types are powerful tools. In addition, JIT is
> also a powerful tool, as it can inline megamorphic virtual functions for
> many scenarios, if the algorithm is implemented appropriately.
> IMO, code generation is applicable to almost all scenarios to achieve good
> performance, if we are willing to pay the price of code readability.
> I will try to detail the principles for choosing these tools for
> performance improvements later.
>
> Best,
> Liya Fan
>
> ------------------------------------------------------------------
> 发件人：Praveen Kumar <pr...@dremio.com>
> 发送时间：2019年10月4日(星期五) 19:20
> 收件人：Micah Kornfield <em...@gmail.com>
> 抄 送：Fan Liya <li...@gmail.com>; dev <de...@arrow.apache.org>
> 主 题：Re: [DISCUSS][Java] Design of the algorithm module
>
> Hi Micah,
>
> I agree with 1., i think as an end user, what they would really want is a
> query/data processing engine. I am not sure how easy/relevant the
> algorithms will be in the absence of the engine. For e.g. most of these
> operators would need to pipelined, handle memory, distribution etc. So
> bundling this along with engine makes a lot more sense, the interfaces
> required might be a bit different too for that.
>
> Thx.
>
>
>
> On Thu, Oct 3, 2019 at 10:27 AM Micah Kornfield <em...@gmail.com>
> wrote:
>
> > Hi Liya Fan,
> > Thanks again for writing this up.  I think it provides a road-map for
>
> > intended features.  I commented on the document but I wanted to raise a few
> > high-level concerns here as well to get more feedback from the community.
> >
>
> > 1.  It isn't clear to me who the users will of this will be.  My perception
> > is that in the Java ecosystem there aren't use-cases for the algorithms
>
> > outside of specific compute engines.  I'm not super involved in open-source
>
> > Java these days so I would love to hear others opinions. For instance, I'm
> > not sure if Dremio would switch to using these algorithms instead of the
> > ones they've already open-sourced  [1] and Apache Spark I believe is only
> > using Arrow for interfacing with Python (they similarly have there own
>
> > compute pipeline).  I think you mentioned in the past that these are being
> > used internally on an engine that your company is working on, but if that
>
> > is the only consumer it makes me wonder if the algorithm development might
> > be better served as part of that engine.
> >
> > 2.  If we do move forward with this, we also need a plan for how to
> > optimize the algorithms to avoid virtual calls.  There are two high-level
> > approaches template-based and (byte)code generation based.  Both aren't
>
> > applicable in all situations but it would be good to come consensus on when
> > (and when not to) use each.
> >
> > Thanks,
> > Micah
> >
> > [1]
> >
> >
> https://github.com/dremio/dremio-oss/tree/master/sabot/kernel/src/main/java/com/dremio/sabot/op/sort/external
> >
> > On Tue, Sep 24, 2019 at 6:48 AM Fan Liya <li...@gmail.com> wrote:
> >
> > > Hi Micah,
> > >
> > > Thanks for your effort and precious time.
> > > Looking forward to receiving more valuable feedback from you.
> > >
> > > Best,
> > > Liya Fan
> > >
> > > On Tue, Sep 24, 2019 at 2:12 PM Micah Kornfield <emkornfield@gmail.com
> >
> > > wrote:
> > >
> > >> Hi Liya Fan,
> > >> I started reviewing but haven't gotten all the way through it. I will
> > try
> > >> to leave more comments over the next few days.
> > >>
> > >> Thanks again for the write-up I think it will help frame a productive
> > >> conversation.
> > >>
> > >> -Micah
> > >>
> > >> On Tue, Sep 17, 2019 at 1:47 AM Fan Liya <liya.fan03@gmail.com
> > wrote:
> > >>
> > >>> Hi Micah,
> > >>>
> > >>> Thanks for your kind reminder. Comments are enabled now.
> > >>>
> > >>> Best,
> > >>> Liya Fan
> > >>>
> > >>> On Tue, Sep 17, 2019 at 12:45 PM Micah Kornfield <
> > emkornfield@gmail.com>
> > >>> wrote:
> > >>>
> > >>>> Hi Liya Fan,
>
> > >>>> Thank you for this writeup, it doesn't look like comments are enabled
> > on
> > >>>> the document.  Could you allow for them?
> > >>>>
> > >>>> Thanks,
> > >>>> Micah
> > >>>>
> > >>>> On Sat, Sep 14, 2019 at 6:57 AM Fan Liya <li...@gmail.com>
> > wrote:
> > >>>>
> > >>>> > Dear all,
> > >>>> >
>
> > >>>> > We have prepared a document for discussing the requirements, design
> > >>>> and
> > >>>> > implementation issues for the algorithm module of Java:
> > >>>> >
> > >>>> >
> > >>>> >
> > >>>>
> >
> https://docs.google.com/document/d/17nqHWS7gs0vARfeDAcUEbhKMOYHnCtA46TOY_Nls69s/edit?usp=sharing
> > >>>> >
> > >>>> > So far, we have finished the initial draft for sort, search and
> > >>>> dictionary
>
> > >>>> > encoding algorithms. Discussions for more algorithms may be added in
> > >>>> the
> > >>>> > future. This document will keep evolving to reflect the latest
> > >>>> discussion
> > >>>> > results in the community and the latest code changes.
> > >>>> >
> > >>>> > Please give your valuable feedback.
> > >>>> >
> > >>>> > Best,
> > >>>> > Liya Fan
> > >>>> >
> > >>>>
> > >>>
> >
>
>
>

回复：[DISCUSS][Java] Design of the algorithm module

Posted by fan_li_ya <fa...@aliyun.com.INVALID>.

Hi Micah and Praveen,

Thanks a lot for your valuable feedback.

My thoughts on the problems:

1. About audiance of the algorithms:

I think the algorithms should be better termed "micro-algorithms". They are termed "micro" in the sense that they do not directly compose a query engine, because they only provide primitive functionalities (e.g. vector sort).
Instead, they can be used as building blocks for query engines.  The major benefit of the micro-algorithms is their generality: they can be used in wide ranges of common scenarios. They can be used in more than one query engine. In addition, there are other common scenarios, like vector data compression/decompression (e.g. dictionary encoding and RLE encoding, as we have already supported/discussed), IPC communication, data analysis, data mining, etc. 

2. About performance improvments:

Code generation and template types are powerful tools. In addition, JIT is also a powerful tool, as it can inline megamorphic virtual functions for many scenarios, if the algorithm is implemented appropriately.
IMO, code generation is applicable to almost all scenarios to achieve good performance, if we are willing to pay the price of code readability.
I will try to detail the principles for choosing these tools for performance improvements later.

Best,
Liya Fan

------------------------------------------------------------------
发件人：Praveen Kumar <pr...@dremio.com>
发送时间：2019年10月4日(星期五) 19:20
收件人：Micah Kornfield <em...@gmail.com>
抄　送：Fan Liya <li...@gmail.com>; dev <de...@arrow.apache.org>
主　题：Re: [DISCUSS][Java] Design of the algorithm module

Hi Micah,

I agree with 1., i think as an end user, what they would really want is a
query/data processing engine. I am not sure how easy/relevant the
algorithms will be in the absence of the engine. For e.g. most of these
operators would need to pipelined, handle memory, distribution etc. So
bundling this along with engine makes a lot more sense, the interfaces
required might be a bit different too for that.

Thx.

On Thu, Oct 3, 2019 at 10:27 AM Micah Kornfield <em...@gmail.com>
wrote:

> Hi Liya Fan,
> Thanks again for writing this up.  I think it provides a road-map for
> intended features.  I commented on the document but I wanted to raise a few
> high-level concerns here as well to get more feedback from the community.
>
> 1.  It isn't clear to me who the users will of this will be.  My perception
> is that in the Java ecosystem there aren't use-cases for the algorithms
> outside of specific compute engines.  I'm not super involved in open-source
> Java these days so I would love to hear others opinions. For instance, I'm
> not sure if Dremio would switch to using these algorithms instead of the
> ones they've already open-sourced  [1] and Apache Spark I believe is only
> using Arrow for interfacing with Python (they similarly have there own
> compute pipeline).  I think you mentioned in the past that these are being
> used internally on an engine that your company is working on, but if that
> is the only consumer it makes me wonder if the algorithm development might
> be better served as part of that engine.
>
> 2.  If we do move forward with this, we also need a plan for how to
> optimize the algorithms to avoid virtual calls.  There are two high-level
> approaches template-based and (byte)code generation based.  Both aren't
> applicable in all situations but it would be good to come consensus on when
> (and when not to) use each.
>
> Thanks,
> Micah
>
> [1]
>
> https://github.com/dremio/dremio-oss/tree/master/sabot/kernel/src/main/java/com/dremio/sabot/op/sort/external
>
> On Tue, Sep 24, 2019 at 6:48 AM Fan Liya <li...@gmail.com> wrote:
>
> > Hi Micah,
> >
> > Thanks for your effort and precious time.
> > Looking forward to receiving more valuable feedback from you.
> >
> > Best,
> > Liya Fan
> >
> > On Tue, Sep 24, 2019 at 2:12 PM Micah Kornfield <em...@gmail.com>
> > wrote:
> >
> >> Hi Liya Fan,
> >> I started reviewing but haven't gotten all the way through it. I will
> try
> >> to leave more comments over the next few days.
> >>
> >> Thanks again for the write-up I think it will help frame a productive
> >> conversation.
> >>
> >> -Micah
> >>
> >> On Tue, Sep 17, 2019 at 1:47 AM Fan Liya <li...@gmail.com> wrote:
> >>
> >>> Hi Micah,
> >>>
> >>> Thanks for your kind reminder. Comments are enabled now.
> >>>
> >>> Best,
> >>> Liya Fan
> >>>
> >>> On Tue, Sep 17, 2019 at 12:45 PM Micah Kornfield <
> emkornfield@gmail.com>
> >>> wrote:
> >>>
> >>>> Hi Liya Fan,
> >>>> Thank you for this writeup, it doesn't look like comments are enabled
> on
> >>>> the document.  Could you allow for them?
> >>>>
> >>>> Thanks,
> >>>> Micah
> >>>>
> >>>> On Sat, Sep 14, 2019 at 6:57 AM Fan Liya <li...@gmail.com>
> wrote:
> >>>>
> >>>> > Dear all,
> >>>> >
> >>>> > We have prepared a document for discussing the requirements, design
> >>>> and
> >>>> > implementation issues for the algorithm module of Java:
> >>>> >
> >>>> >
> >>>> >
> >>>>
> https://docs.google.com/document/d/17nqHWS7gs0vARfeDAcUEbhKMOYHnCtA46TOY_Nls69s/edit?usp=sharing
> >>>> >
> >>>> > So far, we have finished the initial draft for sort, search and
> >>>> dictionary
> >>>> > encoding algorithms. Discussions for more algorithms may be added in
> >>>> the
> >>>> > future. This document will keep evolving to reflect the latest
> >>>> discussion
> >>>> > results in the community and the latest code changes.
> >>>> >
> >>>> > Please give your valuable feedback.
> >>>> >
> >>>> > Best,
> >>>> > Liya Fan
> >>>> >
> >>>>
> >>>
>

Re: [DISCUSS][Java] Design of the algorithm module

Posted by Praveen Kumar <pr...@dremio.com>.

Hi Micah,

I agree with 1., i think as an end user, what they would really want is a
query/data processing engine. I am not sure how easy/relevant the
algorithms will be in the absence of the engine. For e.g. most of these
operators would need to pipelined, handle memory, distribution etc. So
bundling this along with engine makes a lot more sense, the interfaces
required might be a bit different too for that.

Thx.



On Thu, Oct 3, 2019 at 10:27 AM Micah Kornfield <em...@gmail.com>
wrote:

> Hi Liya Fan,
> Thanks again for writing this up.  I think it provides a road-map for
> intended features.  I commented on the document but I wanted to raise a few
> high-level concerns here as well to get more feedback from the community.
>
> 1.  It isn't clear to me who the users will of this will be.  My perception
> is that in the Java ecosystem there aren't use-cases for the algorithms
> outside of specific compute engines.  I'm not super involved in open-source
> Java these days so I would love to hear others opinions. For instance, I'm
> not sure if Dremio would switch to using these algorithms instead of the
> ones they've already open-sourced  [1] and Apache Spark I believe is only
> using Arrow for interfacing with Python (they similarly have there own
> compute pipeline).  I think you mentioned in the past that these are being
> used internally on an engine that your company is working on, but if that
> is the only consumer it makes me wonder if the algorithm development might
> be better served as part of that engine.
>
> 2.  If we do move forward with this, we also need a plan for how to
> optimize the algorithms to avoid virtual calls.  There are two high-level
> approaches template-based and (byte)code generation based.  Both aren't
> applicable in all situations but it would be good to come consensus on when
> (and when not to) use each.
>
> Thanks,
> Micah
>
> [1]
>
> https://github.com/dremio/dremio-oss/tree/master/sabot/kernel/src/main/java/com/dremio/sabot/op/sort/external
>
> On Tue, Sep 24, 2019 at 6:48 AM Fan Liya <li...@gmail.com> wrote:
>
> > Hi Micah,
> >
> > Thanks for your effort and precious time.
> > Looking forward to receiving more valuable feedback from you.
> >
> > Best,
> > Liya Fan
> >
> > On Tue, Sep 24, 2019 at 2:12 PM Micah Kornfield <em...@gmail.com>
> > wrote:
> >
> >> Hi Liya Fan,
> >> I started reviewing but haven't gotten all the way through it. I will
> try
> >> to leave more comments over the next few days.
> >>
> >> Thanks again for the write-up I think it will help frame a productive
> >> conversation.
> >>
> >> -Micah
> >>
> >> On Tue, Sep 17, 2019 at 1:47 AM Fan Liya <li...@gmail.com> wrote:
> >>
> >>> Hi Micah,
> >>>
> >>> Thanks for your kind reminder. Comments are enabled now.
> >>>
> >>> Best,
> >>> Liya Fan
> >>>
> >>> On Tue, Sep 17, 2019 at 12:45 PM Micah Kornfield <
> emkornfield@gmail.com>
> >>> wrote:
> >>>
> >>>> Hi Liya Fan,
> >>>> Thank you for this writeup, it doesn't look like comments are enabled
> on
> >>>> the document.  Could you allow for them?
> >>>>
> >>>> Thanks,
> >>>> Micah
> >>>>
> >>>> On Sat, Sep 14, 2019 at 6:57 AM Fan Liya <li...@gmail.com>
> wrote:
> >>>>
> >>>> > Dear all,
> >>>> >
> >>>> > We have prepared a document for discussing the requirements, design
> >>>> and
> >>>> > implementation issues for the algorithm module of Java:
> >>>> >
> >>>> >
> >>>> >
> >>>>
> https://docs.google.com/document/d/17nqHWS7gs0vARfeDAcUEbhKMOYHnCtA46TOY_Nls69s/edit?usp=sharing
> >>>> >
> >>>> > So far, we have finished the initial draft for sort, search and
> >>>> dictionary
> >>>> > encoding algorithms. Discussions for more algorithms may be added in
> >>>> the
> >>>> > future. This document will keep evolving to reflect the latest
> >>>> discussion
> >>>> > results in the community and the latest code changes.
> >>>> >
> >>>> > Please give your valuable feedback.
> >>>> >
> >>>> > Best,
> >>>> > Liya Fan
> >>>> >
> >>>>
> >>>
>