You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Jacques Nadeau <ja...@apache.org> on 2019/07/21 21:41:30 UTC

[DISCUSS][JAVA] Designs & goals for readers/writers

I've seen a couple of recent pieces of work on generating new
readers/writers for Arrow (Avro and discussion of CSV). I'd like to propose
a couple of guidelines to help ensure a high quality bar:

   1. Design review first - Before someone starts implementing a particular
   reader/writer, let's ask for a basic design outline in jira, google docs,
   etc.
   2. High bar for implementation: Having more readers for the sake of more
   readers should not be the goal of the project. Instead, people should
   expect Arrow Java readers to be high quality and faster than other readers
   (even if the consumer has to do a final conversion to move from the Arrow
   representation to their current internal representation). As such, I
   propose the following two bars as part of design work:
      1. Field selection support as part of reads - Make sure that each
      implementation supports field selection (which columns to materialize) as
      part of the interface.
      2. Configurable target batch size - Different systems will want to
      control the target size of batch data.
      3. Minimize use of heap memory - Most of the core existing Arrow Java
      libraries have been very focused on minimizing on-heap memory
consumption.
      While there may be some, we continue to try reduce the footprint as small
      as possible. When creating new readers/writers, I think we should target
      the same standard for new readers. For example, the current Avro
reader PR
      relies heavily on the Java Avro project's reader implementation which has
      very poor heap characteristics.
      4. Industry leading performance - People should expect that using
      Arrow stuff is very fast. Releasing something under this banner means we
      should focus on achieving that kind of target. To pick on the Avro reader
      again here, our previous analysis has shown that the Java Avro project's
      reader (not the Arrow connected impl) is frequently an order of
magnitude+
      slower than some other open source Avro readers (such as Impala's
      implementation), especially when applying any predicates or projections.
      5. (Ideally) Predicate application as part of reads - 99% in
      workloads we've, a user is frequently applying one or more
predicates when
      reading data. Whatever performance you gain from a strong implementation
      for reads will be drown out in most cases if you fail apply predicates as
      part of reading (and thus have to materialize far more records
than you'll
      need in a minute).
   3. Propose a generalized "reader" interface as opposed to making each
   reader have a different way to package/integrate.

What do other people think?

Re: [DISCUSS][JAVA] Designs & goals for readers/writers

Posted by Ji Liu <ni...@aliyun.com.INVALID>.
Thanks for your proposal.
Agreed Arrow readers/writers should have high performance like Orc reader, and as mentioned above, I think the current Avro adapter should be positioned as adapter rather than native reader. Not sure whether Arrow requires adapter based on library, I update the current design in ARROW-5845[1] for your information anyway.


Thanks,
Ji Liu

[1] https://issues.apache.org/jira/browse/ARROW-5845


------------------------------------------------------------------
From:Jacques Nadeau <ja...@apache.org>
Send Time:2019年7月22日(星期一) 09:16
To:dev <de...@arrow.apache.org>; Micah Kornfield <em...@gmail.com>
Subject:Re: [DISCUSS][JAVA] Designs & goals for readers/writers

As I read through your responses, I think it might be useful to talk about
adapters versus native Arrow readers/writers. Adapters are something that
adapt an existing API to produce and/or consume Arrow data. A native
reader/writer is something that understand the format directly and does not
have intermediate representations or APIs the data moves through beyond
those that needs to be used to complete work.

If people want to write adapters for Arrow, I see that as useful but very
different than writing native implementations and we should try to create a
clear delineation between the two.

Further comments inline.


> Could you expand on what level of detail you would like to see a design
> document?
>

A couple paragraphs seems sufficient. This is the goals of the
implementation. We target existing functionality X. It is an adapter. Or it
is a native impl. This is the expected memory and processing
characteristics, etc.  I've never been one for huge amount of design but
I've seen a number of recent patches appear where this is no upfront
discussion. Making sure that multiple buy into a design is the best way to
ensure long-term maintenance and use.


> I think this should be optional (the same argument below about predicates
> apply so I won't repeat them).
>

Per my comments above, maybe adapter versus native reader clarifies things.
For example, I've been working on a native avro read implementation. It is
little more than chicken scratch at this point but its goals, vision and
design are very different than the adapter that is being produced atm.


> Can you clarify the intent of this objective.  Is it mainly to tie in with
> the existing Java arrow memory book keeping?  Performance?  Something else?
>

Arrow is designed to be off-heap. If you have large variable amounts of
on-heap memory in an application, it starts to make it very hard to make
decisions about off-heap versus on-heap memory since those divisions are by
and large static in nature. It's fine for short lived applications but for
long lived applications, if you're working with a large amount of data, you
want to keep most of your memory in one pool. In the context of Arrow, this
is going to naturally be off-heap memory.


> I'm afraid this might lead to a "perfect is the enemy of the good"
> situation.  Starting off with a known good implementation of conversion to
> Arrow can allow us to both to profile hot-spots and provide a comparison of
> implementations to verify correctness.
>

I'm not clear what message we're sending as a community if we produce low
performance components. The whole of Arrow is to increase performance, not
decrease it. I'm targeting good, not perfect. At the same time, from my
perspective, Arrow development should not be approached in the same way
that general Java app development should be. If we hold a high standard,
we'll have less total integrations initially but I think we'll solve more
real world problems.

There is also the question of how widely adoptable we want Arrow libraries
> to be.
> It isn't surprising to me that Impala's Avro reader is an order of
> magnitude faster then the stock Java one.  As far as I know Impala's is a
> C++ implementation that does JIT with LLVM.  We could try to use it as a
> basis for converting to Arrow but I think this might limit adoption in some
> circumstances.  Some organizations/people might be hesitant to adopt the
> technology due to:
> 1.  Use of JNI.
> 2.  Use LLVM to do JIT.
>
> It seems that as long as we have a reasonably general interface to
> data-sources we should be able to optimize/refactor aggressively when
> needed.
>

This is somewhat the crux of the problem. It goes a little bit to who our
consuming audience is and what we're trying to deliver. I'll also say that
trying to build a high-quality implementation on top of low-quality
implementation or library-based adapter is worse than starting from
scratch. I believe this is especially true in Java where developers are
trained to trust hotspot and that things will be good enough. That is great
in a web app but not in systems software where we (and I expect others)
will deploy Arrow.


> >    3. Propose a generalized "reader" interface as opposed to making each
> >    reader have a different way to package/integrate.
>
> This also seems like a good idea.  Is this something you were thinking of
> doing or just a proposal that someone in the community should take up
> before we get too many more implementations?
>

I don't have something in mind and didn't have a plan to build something,
just want to make sure we start getting consistent early as opposed to once
we have a bunch of readers/adapters.

Re: [DISCUSS][JAVA] Designs & goals for readers/writers

Posted by Wes McKinney <we...@gmail.com>.
Yes I think text files are OK but I want to make sure that committers are
reviewing patches for binary files because there have been a number of
incidents in the past where I had to roll back patches to remove such
files.

On Tue, Jul 23, 2019, 10:37 AM Micah Kornfield <em...@gmail.com>
wrote:

> Hi Wes,
> I haven't checked locally but that file at least for me renders as text
> file in GitHub (with an Apache header).  If we want all test data in the
> testing package I can make sure to move it but I thought text files might
> be ok in the main repo?
>
> Thanks,
> Micah
>
> On Tuesday, July 23, 2019, Wes McKinney <we...@gmail.com> wrote:
>
>> I noticed that test data-related files are beginning to be checked in
>>
>>
>> https://github.com/apache/arrow/blob/master/java/adapter/avro/src/test/resources/schema/test.avsc
>>
>> I wanted to make sure this doesn't turn into a slippery slope where we
>> end up with several megabytes or more of test data files
>>
>> On Mon, Jul 22, 2019 at 11:39 PM Micah Kornfield <em...@gmail.com>
>> wrote:
>> >
>> > Hi Wes,
>> > Are there currently files that need to be moved?
>> >
>> > Thanks,
>> > Micah
>> >
>> > On Monday, July 22, 2019, Wes McKinney <we...@gmail.com> wrote:
>> >>
>> >> Sort of tangentially related, but while we are on the topic:
>> >>
>> >> Please, if you would, avoid checking binary test data files into the
>> >> main repository. Use https://github.com/apache/arrow-testing if you
>> >> truly need to check in binary data -- something to look out for in
>> >> code reviews
>> >>
>> >> On Mon, Jul 22, 2019 at 10:38 AM Micah Kornfield <
>> emkornfield@gmail.com> wrote:
>> >> >
>> >> > Hi Jacques,
>> >> > Thanks for the clarifications. I think the distinction is useful.
>> >> >
>> >> > If people want to write adapters for Arrow, I see that as useful but
>> very
>> >> > > different than writing native implementations and we should try to
>> create a
>> >> > > clear delineation between the two.
>> >> >
>> >> >
>> >> > What do you think about creating a "contrib" directory and moving
>> the JDBC
>> >> > and AVRO adapters into it? We should also probably provide more
>> description
>> >> > in pom.xml to make it clear for downstream consumers.
>> >> >
>> >> > We should probably come up with a name other than adapters for
>> >> > readers/writer ("converters"?) and use it in the directory structure
>> for
>> >> > the existing Orc implementation?
>> >> >
>> >> > Thanks,
>> >> > Micah
>> >> >
>> >> >
>> >> > On Sun, Jul 21, 2019 at 6:09 PM Jacques Nadeau <ja...@apache.org>
>> wrote:
>> >> >
>> >> > > As I read through your responses, I think it might be useful to
>> talk about
>> >> > > adapters versus native Arrow readers/writers. Adapters are
>> something that
>> >> > > adapt an existing API to produce and/or consume Arrow data. A
>> native
>> >> > > reader/writer is something that understand the format directly and
>> does not
>> >> > > have intermediate representations or APIs the data moves through
>> beyond
>> >> > > those that needs to be used to complete work.
>> >> > >
>> >> > > If people want to write adapters for Arrow, I see that as useful
>> but very
>> >> > > different than writing native implementations and we should try to
>> create a
>> >> > > clear delineation between the two.
>> >> > >
>> >> > > Further comments inline.
>> >> > >
>> >> > >
>> >> > >> Could you expand on what level of detail you would like to see a
>> design
>> >> > >> document?
>> >> > >>
>> >> > >
>> >> > > A couple paragraphs seems sufficient. This is the goals of the
>> >> > > implementation. We target existing functionality X. It is an
>> adapter. Or it
>> >> > > is a native impl. This is the expected memory and processing
>> >> > > characteristics, etc.  I've never been one for huge amount of
>> design but
>> >> > > I've seen a number of recent patches appear where this is no
>> upfront
>> >> > > discussion. Making sure that multiple buy into a design is the
>> best way to
>> >> > > ensure long-term maintenance and use.
>> >> > >
>> >> > >
>> >> > >> I think this should be optional (the same argument below about
>> predicates
>> >> > >> apply so I won't repeat them).
>> >> > >>
>> >> > >
>> >> > > Per my comments above, maybe adapter versus native reader clarifies
>> >> > > things. For example, I've been working on a native avro read
>> >> > > implementation. It is little more than chicken scratch at this
>> point but
>> >> > > its goals, vision and design are very different than the adapter
>> that is
>> >> > > being produced atm.
>> >> > >
>> >> > >
>> >> > >> Can you clarify the intent of this objective.  Is it mainly to
>> tie in with
>> >> > >> the existing Java arrow memory book keeping?  Performance?
>> Something
>> >> > >> else?
>> >> > >>
>> >> > >
>> >> > > Arrow is designed to be off-heap. If you have large variable
>> amounts of
>> >> > > on-heap memory in an application, it starts to make it very hard
>> to make
>> >> > > decisions about off-heap versus on-heap memory since those
>> divisions are by
>> >> > > and large static in nature. It's fine for short lived applications
>> but for
>> >> > > long lived applications, if you're working with a large amount of
>> data, you
>> >> > > want to keep most of your memory in one pool. In the context of
>> Arrow, this
>> >> > > is going to naturally be off-heap memory.
>> >> > >
>> >> > >
>> >> > >> I'm afraid this might lead to a "perfect is the enemy of the good"
>> >> > >> situation.  Starting off with a known good implementation of
>> conversion to
>> >> > >> Arrow can allow us to both to profile hot-spots and provide a
>> comparison
>> >> > >> of
>> >> > >> implementations to verify correctness.
>> >> > >>
>> >> > >
>> >> > > I'm not clear what message we're sending as a community if we
>> produce low
>> >> > > performance components. The whole of Arrow is to increase
>> performance, not
>> >> > > decrease it. I'm targeting good, not perfect. At the same time,
>> from my
>> >> > > perspective, Arrow development should not be approached in the
>> same way
>> >> > > that general Java app development should be. If we hold a high
>> standard,
>> >> > > we'll have less total integrations initially but I think we'll
>> solve more
>> >> > > real world problems.
>> >> > >
>> >> > > There is also the question of how widely adoptable we want Arrow
>> libraries
>> >> > >> to be.
>> >> > >> It isn't surprising to me that Impala's Avro reader is an order of
>> >> > >> magnitude faster then the stock Java one.  As far as I know
>> Impala's is a
>> >> > >> C++ implementation that does JIT with LLVM.  We could try to use
>> it as a
>> >> > >> basis for converting to Arrow but I think this might limit
>> adoption in
>> >> > >> some
>> >> > >> circumstances.  Some organizations/people might be hesitant to
>> adopt the
>> >> > >> technology due to:
>> >> > >> 1.  Use of JNI.
>> >> > >> 2.  Use LLVM to do JIT.
>> >> > >>
>> >> > >> It seems that as long as we have a reasonably general interface to
>> >> > >> data-sources we should be able to optimize/refactor aggressively
>> when
>> >> > >> needed.
>> >> > >>
>> >> > >
>> >> > > This is somewhat the crux of the problem. It goes a little bit to
>> who our
>> >> > > consuming audience is and what we're trying to deliver. I'll also
>> say that
>> >> > > trying to build a high-quality implementation on top of low-quality
>> >> > > implementation or library-based adapter is worse than starting from
>> >> > > scratch. I believe this is especially true in Java where
>> developers are
>> >> > > trained to trust hotspot and that things will be good enough. That
>> is great
>> >> > > in a web app but not in systems software where we (and I expect
>> others)
>> >> > > will deploy Arrow.
>> >> > >
>> >> > >
>> >> > >> >    3. Propose a generalized "reader" interface as opposed to
>> making each
>> >> > >> >    reader have a different way to package/integrate.
>> >> > >>
>> >> > >> This also seems like a good idea.  Is this something you were
>> thinking of
>> >> > >> doing or just a proposal that someone in the community should
>> take up
>> >> > >> before we get too many more implementations?
>> >> > >>
>> >> > >
>> >> > > I don't have something in mind and didn't have a plan to build
>> something,
>> >> > > just want to make sure we start getting consistent early as
>> opposed to once
>> >> > > we have a bunch of readers/adapters.
>> >> > >
>>
>

Re: [DISCUSS][JAVA] Designs & goals for readers/writers

Posted by Micah Kornfield <em...@gmail.com>.
Hi Wes,
I haven't checked locally but that file at least for me renders as text
file in GitHub (with an Apache header).  If we want all test data in the
testing package I can make sure to move it but I thought text files might
be ok in the main repo?

Thanks,
Micah

On Tuesday, July 23, 2019, Wes McKinney <we...@gmail.com> wrote:

> I noticed that test data-related files are beginning to be checked in
>
> https://github.com/apache/arrow/blob/master/java/adapter/avro/src/test/
> resources/schema/test.avsc
>
> I wanted to make sure this doesn't turn into a slippery slope where we
> end up with several megabytes or more of test data files
>
> On Mon, Jul 22, 2019 at 11:39 PM Micah Kornfield <em...@gmail.com>
> wrote:
> >
> > Hi Wes,
> > Are there currently files that need to be moved?
> >
> > Thanks,
> > Micah
> >
> > On Monday, July 22, 2019, Wes McKinney <we...@gmail.com> wrote:
> >>
> >> Sort of tangentially related, but while we are on the topic:
> >>
> >> Please, if you would, avoid checking binary test data files into the
> >> main repository. Use https://github.com/apache/arrow-testing if you
> >> truly need to check in binary data -- something to look out for in
> >> code reviews
> >>
> >> On Mon, Jul 22, 2019 at 10:38 AM Micah Kornfield <em...@gmail.com>
> wrote:
> >> >
> >> > Hi Jacques,
> >> > Thanks for the clarifications. I think the distinction is useful.
> >> >
> >> > If people want to write adapters for Arrow, I see that as useful but
> very
> >> > > different than writing native implementations and we should try to
> create a
> >> > > clear delineation between the two.
> >> >
> >> >
> >> > What do you think about creating a "contrib" directory and moving the
> JDBC
> >> > and AVRO adapters into it? We should also probably provide more
> description
> >> > in pom.xml to make it clear for downstream consumers.
> >> >
> >> > We should probably come up with a name other than adapters for
> >> > readers/writer ("converters"?) and use it in the directory structure
> for
> >> > the existing Orc implementation?
> >> >
> >> > Thanks,
> >> > Micah
> >> >
> >> >
> >> > On Sun, Jul 21, 2019 at 6:09 PM Jacques Nadeau <ja...@apache.org>
> wrote:
> >> >
> >> > > As I read through your responses, I think it might be useful to
> talk about
> >> > > adapters versus native Arrow readers/writers. Adapters are
> something that
> >> > > adapt an existing API to produce and/or consume Arrow data. A native
> >> > > reader/writer is something that understand the format directly and
> does not
> >> > > have intermediate representations or APIs the data moves through
> beyond
> >> > > those that needs to be used to complete work.
> >> > >
> >> > > If people want to write adapters for Arrow, I see that as useful
> but very
> >> > > different than writing native implementations and we should try to
> create a
> >> > > clear delineation between the two.
> >> > >
> >> > > Further comments inline.
> >> > >
> >> > >
> >> > >> Could you expand on what level of detail you would like to see a
> design
> >> > >> document?
> >> > >>
> >> > >
> >> > > A couple paragraphs seems sufficient. This is the goals of the
> >> > > implementation. We target existing functionality X. It is an
> adapter. Or it
> >> > > is a native impl. This is the expected memory and processing
> >> > > characteristics, etc.  I've never been one for huge amount of
> design but
> >> > > I've seen a number of recent patches appear where this is no upfront
> >> > > discussion. Making sure that multiple buy into a design is the best
> way to
> >> > > ensure long-term maintenance and use.
> >> > >
> >> > >
> >> > >> I think this should be optional (the same argument below about
> predicates
> >> > >> apply so I won't repeat them).
> >> > >>
> >> > >
> >> > > Per my comments above, maybe adapter versus native reader clarifies
> >> > > things. For example, I've been working on a native avro read
> >> > > implementation. It is little more than chicken scratch at this
> point but
> >> > > its goals, vision and design are very different than the adapter
> that is
> >> > > being produced atm.
> >> > >
> >> > >
> >> > >> Can you clarify the intent of this objective.  Is it mainly to tie
> in with
> >> > >> the existing Java arrow memory book keeping?  Performance?
> Something
> >> > >> else?
> >> > >>
> >> > >
> >> > > Arrow is designed to be off-heap. If you have large variable
> amounts of
> >> > > on-heap memory in an application, it starts to make it very hard to
> make
> >> > > decisions about off-heap versus on-heap memory since those
> divisions are by
> >> > > and large static in nature. It's fine for short lived applications
> but for
> >> > > long lived applications, if you're working with a large amount of
> data, you
> >> > > want to keep most of your memory in one pool. In the context of
> Arrow, this
> >> > > is going to naturally be off-heap memory.
> >> > >
> >> > >
> >> > >> I'm afraid this might lead to a "perfect is the enemy of the good"
> >> > >> situation.  Starting off with a known good implementation of
> conversion to
> >> > >> Arrow can allow us to both to profile hot-spots and provide a
> comparison
> >> > >> of
> >> > >> implementations to verify correctness.
> >> > >>
> >> > >
> >> > > I'm not clear what message we're sending as a community if we
> produce low
> >> > > performance components. The whole of Arrow is to increase
> performance, not
> >> > > decrease it. I'm targeting good, not perfect. At the same time,
> from my
> >> > > perspective, Arrow development should not be approached in the same
> way
> >> > > that general Java app development should be. If we hold a high
> standard,
> >> > > we'll have less total integrations initially but I think we'll
> solve more
> >> > > real world problems.
> >> > >
> >> > > There is also the question of how widely adoptable we want Arrow
> libraries
> >> > >> to be.
> >> > >> It isn't surprising to me that Impala's Avro reader is an order of
> >> > >> magnitude faster then the stock Java one.  As far as I know
> Impala's is a
> >> > >> C++ implementation that does JIT with LLVM.  We could try to use
> it as a
> >> > >> basis for converting to Arrow but I think this might limit
> adoption in
> >> > >> some
> >> > >> circumstances.  Some organizations/people might be hesitant to
> adopt the
> >> > >> technology due to:
> >> > >> 1.  Use of JNI.
> >> > >> 2.  Use LLVM to do JIT.
> >> > >>
> >> > >> It seems that as long as we have a reasonably general interface to
> >> > >> data-sources we should be able to optimize/refactor aggressively
> when
> >> > >> needed.
> >> > >>
> >> > >
> >> > > This is somewhat the crux of the problem. It goes a little bit to
> who our
> >> > > consuming audience is and what we're trying to deliver. I'll also
> say that
> >> > > trying to build a high-quality implementation on top of low-quality
> >> > > implementation or library-based adapter is worse than starting from
> >> > > scratch. I believe this is especially true in Java where developers
> are
> >> > > trained to trust hotspot and that things will be good enough. That
> is great
> >> > > in a web app but not in systems software where we (and I expect
> others)
> >> > > will deploy Arrow.
> >> > >
> >> > >
> >> > >> >    3. Propose a generalized "reader" interface as opposed to
> making each
> >> > >> >    reader have a different way to package/integrate.
> >> > >>
> >> > >> This also seems like a good idea.  Is this something you were
> thinking of
> >> > >> doing or just a proposal that someone in the community should take
> up
> >> > >> before we get too many more implementations?
> >> > >>
> >> > >
> >> > > I don't have something in mind and didn't have a plan to build
> something,
> >> > > just want to make sure we start getting consistent early as opposed
> to once
> >> > > we have a bunch of readers/adapters.
> >> > >
>

Re: [DISCUSS][JAVA] Designs & goals for readers/writers

Posted by Wes McKinney <we...@gmail.com>.
I noticed that test data-related files are beginning to be checked in

https://github.com/apache/arrow/blob/master/java/adapter/avro/src/test/resources/schema/test.avsc

I wanted to make sure this doesn't turn into a slippery slope where we
end up with several megabytes or more of test data files

On Mon, Jul 22, 2019 at 11:39 PM Micah Kornfield <em...@gmail.com> wrote:
>
> Hi Wes,
> Are there currently files that need to be moved?
>
> Thanks,
> Micah
>
> On Monday, July 22, 2019, Wes McKinney <we...@gmail.com> wrote:
>>
>> Sort of tangentially related, but while we are on the topic:
>>
>> Please, if you would, avoid checking binary test data files into the
>> main repository. Use https://github.com/apache/arrow-testing if you
>> truly need to check in binary data -- something to look out for in
>> code reviews
>>
>> On Mon, Jul 22, 2019 at 10:38 AM Micah Kornfield <em...@gmail.com> wrote:
>> >
>> > Hi Jacques,
>> > Thanks for the clarifications. I think the distinction is useful.
>> >
>> > If people want to write adapters for Arrow, I see that as useful but very
>> > > different than writing native implementations and we should try to create a
>> > > clear delineation between the two.
>> >
>> >
>> > What do you think about creating a "contrib" directory and moving the JDBC
>> > and AVRO adapters into it? We should also probably provide more description
>> > in pom.xml to make it clear for downstream consumers.
>> >
>> > We should probably come up with a name other than adapters for
>> > readers/writer ("converters"?) and use it in the directory structure for
>> > the existing Orc implementation?
>> >
>> > Thanks,
>> > Micah
>> >
>> >
>> > On Sun, Jul 21, 2019 at 6:09 PM Jacques Nadeau <ja...@apache.org> wrote:
>> >
>> > > As I read through your responses, I think it might be useful to talk about
>> > > adapters versus native Arrow readers/writers. Adapters are something that
>> > > adapt an existing API to produce and/or consume Arrow data. A native
>> > > reader/writer is something that understand the format directly and does not
>> > > have intermediate representations or APIs the data moves through beyond
>> > > those that needs to be used to complete work.
>> > >
>> > > If people want to write adapters for Arrow, I see that as useful but very
>> > > different than writing native implementations and we should try to create a
>> > > clear delineation between the two.
>> > >
>> > > Further comments inline.
>> > >
>> > >
>> > >> Could you expand on what level of detail you would like to see a design
>> > >> document?
>> > >>
>> > >
>> > > A couple paragraphs seems sufficient. This is the goals of the
>> > > implementation. We target existing functionality X. It is an adapter. Or it
>> > > is a native impl. This is the expected memory and processing
>> > > characteristics, etc.  I've never been one for huge amount of design but
>> > > I've seen a number of recent patches appear where this is no upfront
>> > > discussion. Making sure that multiple buy into a design is the best way to
>> > > ensure long-term maintenance and use.
>> > >
>> > >
>> > >> I think this should be optional (the same argument below about predicates
>> > >> apply so I won't repeat them).
>> > >>
>> > >
>> > > Per my comments above, maybe adapter versus native reader clarifies
>> > > things. For example, I've been working on a native avro read
>> > > implementation. It is little more than chicken scratch at this point but
>> > > its goals, vision and design are very different than the adapter that is
>> > > being produced atm.
>> > >
>> > >
>> > >> Can you clarify the intent of this objective.  Is it mainly to tie in with
>> > >> the existing Java arrow memory book keeping?  Performance?  Something
>> > >> else?
>> > >>
>> > >
>> > > Arrow is designed to be off-heap. If you have large variable amounts of
>> > > on-heap memory in an application, it starts to make it very hard to make
>> > > decisions about off-heap versus on-heap memory since those divisions are by
>> > > and large static in nature. It's fine for short lived applications but for
>> > > long lived applications, if you're working with a large amount of data, you
>> > > want to keep most of your memory in one pool. In the context of Arrow, this
>> > > is going to naturally be off-heap memory.
>> > >
>> > >
>> > >> I'm afraid this might lead to a "perfect is the enemy of the good"
>> > >> situation.  Starting off with a known good implementation of conversion to
>> > >> Arrow can allow us to both to profile hot-spots and provide a comparison
>> > >> of
>> > >> implementations to verify correctness.
>> > >>
>> > >
>> > > I'm not clear what message we're sending as a community if we produce low
>> > > performance components. The whole of Arrow is to increase performance, not
>> > > decrease it. I'm targeting good, not perfect. At the same time, from my
>> > > perspective, Arrow development should not be approached in the same way
>> > > that general Java app development should be. If we hold a high standard,
>> > > we'll have less total integrations initially but I think we'll solve more
>> > > real world problems.
>> > >
>> > > There is also the question of how widely adoptable we want Arrow libraries
>> > >> to be.
>> > >> It isn't surprising to me that Impala's Avro reader is an order of
>> > >> magnitude faster then the stock Java one.  As far as I know Impala's is a
>> > >> C++ implementation that does JIT with LLVM.  We could try to use it as a
>> > >> basis for converting to Arrow but I think this might limit adoption in
>> > >> some
>> > >> circumstances.  Some organizations/people might be hesitant to adopt the
>> > >> technology due to:
>> > >> 1.  Use of JNI.
>> > >> 2.  Use LLVM to do JIT.
>> > >>
>> > >> It seems that as long as we have a reasonably general interface to
>> > >> data-sources we should be able to optimize/refactor aggressively when
>> > >> needed.
>> > >>
>> > >
>> > > This is somewhat the crux of the problem. It goes a little bit to who our
>> > > consuming audience is and what we're trying to deliver. I'll also say that
>> > > trying to build a high-quality implementation on top of low-quality
>> > > implementation or library-based adapter is worse than starting from
>> > > scratch. I believe this is especially true in Java where developers are
>> > > trained to trust hotspot and that things will be good enough. That is great
>> > > in a web app but not in systems software where we (and I expect others)
>> > > will deploy Arrow.
>> > >
>> > >
>> > >> >    3. Propose a generalized "reader" interface as opposed to making each
>> > >> >    reader have a different way to package/integrate.
>> > >>
>> > >> This also seems like a good idea.  Is this something you were thinking of
>> > >> doing or just a proposal that someone in the community should take up
>> > >> before we get too many more implementations?
>> > >>
>> > >
>> > > I don't have something in mind and didn't have a plan to build something,
>> > > just want to make sure we start getting consistent early as opposed to once
>> > > we have a bunch of readers/adapters.
>> > >

Re: [DISCUSS][JAVA] Designs & goals for readers/writers

Posted by Micah Kornfield <em...@gmail.com>.
Hi Wes,
Are there currently files that need to be moved?

Thanks,
Micah

On Monday, July 22, 2019, Wes McKinney <we...@gmail.com> wrote:

> Sort of tangentially related, but while we are on the topic:
>
> Please, if you would, avoid checking binary test data files into the
> main repository. Use https://github.com/apache/arrow-testing if you
> truly need to check in binary data -- something to look out for in
> code reviews
>
> On Mon, Jul 22, 2019 at 10:38 AM Micah Kornfield <em...@gmail.com>
> wrote:
> >
> > Hi Jacques,
> > Thanks for the clarifications. I think the distinction is useful.
> >
> > If people want to write adapters for Arrow, I see that as useful but very
> > > different than writing native implementations and we should try to
> create a
> > > clear delineation between the two.
> >
> >
> > What do you think about creating a "contrib" directory and moving the
> JDBC
> > and AVRO adapters into it? We should also probably provide more
> description
> > in pom.xml to make it clear for downstream consumers.
> >
> > We should probably come up with a name other than adapters for
> > readers/writer ("converters"?) and use it in the directory structure for
> > the existing Orc implementation?
> >
> > Thanks,
> > Micah
> >
> >
> > On Sun, Jul 21, 2019 at 6:09 PM Jacques Nadeau <ja...@apache.org>
> wrote:
> >
> > > As I read through your responses, I think it might be useful to talk
> about
> > > adapters versus native Arrow readers/writers. Adapters are something
> that
> > > adapt an existing API to produce and/or consume Arrow data. A native
> > > reader/writer is something that understand the format directly and
> does not
> > > have intermediate representations or APIs the data moves through beyond
> > > those that needs to be used to complete work.
> > >
> > > If people want to write adapters for Arrow, I see that as useful but
> very
> > > different than writing native implementations and we should try to
> create a
> > > clear delineation between the two.
> > >
> > > Further comments inline.
> > >
> > >
> > >> Could you expand on what level of detail you would like to see a
> design
> > >> document?
> > >>
> > >
> > > A couple paragraphs seems sufficient. This is the goals of the
> > > implementation. We target existing functionality X. It is an adapter.
> Or it
> > > is a native impl. This is the expected memory and processing
> > > characteristics, etc.  I've never been one for huge amount of design
> but
> > > I've seen a number of recent patches appear where this is no upfront
> > > discussion. Making sure that multiple buy into a design is the best
> way to
> > > ensure long-term maintenance and use.
> > >
> > >
> > >> I think this should be optional (the same argument below about
> predicates
> > >> apply so I won't repeat them).
> > >>
> > >
> > > Per my comments above, maybe adapter versus native reader clarifies
> > > things. For example, I've been working on a native avro read
> > > implementation. It is little more than chicken scratch at this point
> but
> > > its goals, vision and design are very different than the adapter that
> is
> > > being produced atm.
> > >
> > >
> > >> Can you clarify the intent of this objective.  Is it mainly to tie in
> with
> > >> the existing Java arrow memory book keeping?  Performance?  Something
> > >> else?
> > >>
> > >
> > > Arrow is designed to be off-heap. If you have large variable amounts of
> > > on-heap memory in an application, it starts to make it very hard to
> make
> > > decisions about off-heap versus on-heap memory since those divisions
> are by
> > > and large static in nature. It's fine for short lived applications but
> for
> > > long lived applications, if you're working with a large amount of
> data, you
> > > want to keep most of your memory in one pool. In the context of Arrow,
> this
> > > is going to naturally be off-heap memory.
> > >
> > >
> > >> I'm afraid this might lead to a "perfect is the enemy of the good"
> > >> situation.  Starting off with a known good implementation of
> conversion to
> > >> Arrow can allow us to both to profile hot-spots and provide a
> comparison
> > >> of
> > >> implementations to verify correctness.
> > >>
> > >
> > > I'm not clear what message we're sending as a community if we produce
> low
> > > performance components. The whole of Arrow is to increase performance,
> not
> > > decrease it. I'm targeting good, not perfect. At the same time, from my
> > > perspective, Arrow development should not be approached in the same way
> > > that general Java app development should be. If we hold a high
> standard,
> > > we'll have less total integrations initially but I think we'll solve
> more
> > > real world problems.
> > >
> > > There is also the question of how widely adoptable we want Arrow
> libraries
> > >> to be.
> > >> It isn't surprising to me that Impala's Avro reader is an order of
> > >> magnitude faster then the stock Java one.  As far as I know Impala's
> is a
> > >> C++ implementation that does JIT with LLVM.  We could try to use it
> as a
> > >> basis for converting to Arrow but I think this might limit adoption in
> > >> some
> > >> circumstances.  Some organizations/people might be hesitant to adopt
> the
> > >> technology due to:
> > >> 1.  Use of JNI.
> > >> 2.  Use LLVM to do JIT.
> > >>
> > >> It seems that as long as we have a reasonably general interface to
> > >> data-sources we should be able to optimize/refactor aggressively when
> > >> needed.
> > >>
> > >
> > > This is somewhat the crux of the problem. It goes a little bit to who
> our
> > > consuming audience is and what we're trying to deliver. I'll also say
> that
> > > trying to build a high-quality implementation on top of low-quality
> > > implementation or library-based adapter is worse than starting from
> > > scratch. I believe this is especially true in Java where developers are
> > > trained to trust hotspot and that things will be good enough. That is
> great
> > > in a web app but not in systems software where we (and I expect others)
> > > will deploy Arrow.
> > >
> > >
> > >> >    3. Propose a generalized "reader" interface as opposed to making
> each
> > >> >    reader have a different way to package/integrate.
> > >>
> > >> This also seems like a good idea.  Is this something you were
> thinking of
> > >> doing or just a proposal that someone in the community should take up
> > >> before we get too many more implementations?
> > >>
> > >
> > > I don't have something in mind and didn't have a plan to build
> something,
> > > just want to make sure we start getting consistent early as opposed to
> once
> > > we have a bunch of readers/adapters.
> > >
>

Re: [DISCUSS][JAVA] Designs & goals for readers/writers

Posted by Wes McKinney <we...@gmail.com>.
Sort of tangentially related, but while we are on the topic:

Please, if you would, avoid checking binary test data files into the
main repository. Use https://github.com/apache/arrow-testing if you
truly need to check in binary data -- something to look out for in
code reviews

On Mon, Jul 22, 2019 at 10:38 AM Micah Kornfield <em...@gmail.com> wrote:
>
> Hi Jacques,
> Thanks for the clarifications. I think the distinction is useful.
>
> If people want to write adapters for Arrow, I see that as useful but very
> > different than writing native implementations and we should try to create a
> > clear delineation between the two.
>
>
> What do you think about creating a "contrib" directory and moving the JDBC
> and AVRO adapters into it? We should also probably provide more description
> in pom.xml to make it clear for downstream consumers.
>
> We should probably come up with a name other than adapters for
> readers/writer ("converters"?) and use it in the directory structure for
> the existing Orc implementation?
>
> Thanks,
> Micah
>
>
> On Sun, Jul 21, 2019 at 6:09 PM Jacques Nadeau <ja...@apache.org> wrote:
>
> > As I read through your responses, I think it might be useful to talk about
> > adapters versus native Arrow readers/writers. Adapters are something that
> > adapt an existing API to produce and/or consume Arrow data. A native
> > reader/writer is something that understand the format directly and does not
> > have intermediate representations or APIs the data moves through beyond
> > those that needs to be used to complete work.
> >
> > If people want to write adapters for Arrow, I see that as useful but very
> > different than writing native implementations and we should try to create a
> > clear delineation between the two.
> >
> > Further comments inline.
> >
> >
> >> Could you expand on what level of detail you would like to see a design
> >> document?
> >>
> >
> > A couple paragraphs seems sufficient. This is the goals of the
> > implementation. We target existing functionality X. It is an adapter. Or it
> > is a native impl. This is the expected memory and processing
> > characteristics, etc.  I've never been one for huge amount of design but
> > I've seen a number of recent patches appear where this is no upfront
> > discussion. Making sure that multiple buy into a design is the best way to
> > ensure long-term maintenance and use.
> >
> >
> >> I think this should be optional (the same argument below about predicates
> >> apply so I won't repeat them).
> >>
> >
> > Per my comments above, maybe adapter versus native reader clarifies
> > things. For example, I've been working on a native avro read
> > implementation. It is little more than chicken scratch at this point but
> > its goals, vision and design are very different than the adapter that is
> > being produced atm.
> >
> >
> >> Can you clarify the intent of this objective.  Is it mainly to tie in with
> >> the existing Java arrow memory book keeping?  Performance?  Something
> >> else?
> >>
> >
> > Arrow is designed to be off-heap. If you have large variable amounts of
> > on-heap memory in an application, it starts to make it very hard to make
> > decisions about off-heap versus on-heap memory since those divisions are by
> > and large static in nature. It's fine for short lived applications but for
> > long lived applications, if you're working with a large amount of data, you
> > want to keep most of your memory in one pool. In the context of Arrow, this
> > is going to naturally be off-heap memory.
> >
> >
> >> I'm afraid this might lead to a "perfect is the enemy of the good"
> >> situation.  Starting off with a known good implementation of conversion to
> >> Arrow can allow us to both to profile hot-spots and provide a comparison
> >> of
> >> implementations to verify correctness.
> >>
> >
> > I'm not clear what message we're sending as a community if we produce low
> > performance components. The whole of Arrow is to increase performance, not
> > decrease it. I'm targeting good, not perfect. At the same time, from my
> > perspective, Arrow development should not be approached in the same way
> > that general Java app development should be. If we hold a high standard,
> > we'll have less total integrations initially but I think we'll solve more
> > real world problems.
> >
> > There is also the question of how widely adoptable we want Arrow libraries
> >> to be.
> >> It isn't surprising to me that Impala's Avro reader is an order of
> >> magnitude faster then the stock Java one.  As far as I know Impala's is a
> >> C++ implementation that does JIT with LLVM.  We could try to use it as a
> >> basis for converting to Arrow but I think this might limit adoption in
> >> some
> >> circumstances.  Some organizations/people might be hesitant to adopt the
> >> technology due to:
> >> 1.  Use of JNI.
> >> 2.  Use LLVM to do JIT.
> >>
> >> It seems that as long as we have a reasonably general interface to
> >> data-sources we should be able to optimize/refactor aggressively when
> >> needed.
> >>
> >
> > This is somewhat the crux of the problem. It goes a little bit to who our
> > consuming audience is and what we're trying to deliver. I'll also say that
> > trying to build a high-quality implementation on top of low-quality
> > implementation or library-based adapter is worse than starting from
> > scratch. I believe this is especially true in Java where developers are
> > trained to trust hotspot and that things will be good enough. That is great
> > in a web app but not in systems software where we (and I expect others)
> > will deploy Arrow.
> >
> >
> >> >    3. Propose a generalized "reader" interface as opposed to making each
> >> >    reader have a different way to package/integrate.
> >>
> >> This also seems like a good idea.  Is this something you were thinking of
> >> doing or just a proposal that someone in the community should take up
> >> before we get too many more implementations?
> >>
> >
> > I don't have something in mind and didn't have a plan to build something,
> > just want to make sure we start getting consistent early as opposed to once
> > we have a bunch of readers/adapters.
> >

Re: [DISCUSS][JAVA] Designs & goals for readers/writers

Posted by Micah Kornfield <em...@gmail.com>.
Hi Jacques,
Thanks for the clarifications. I think the distinction is useful.

If people want to write adapters for Arrow, I see that as useful but very
> different than writing native implementations and we should try to create a
> clear delineation between the two.


What do you think about creating a "contrib" directory and moving the JDBC
and AVRO adapters into it? We should also probably provide more description
in pom.xml to make it clear for downstream consumers.

We should probably come up with a name other than adapters for
readers/writer ("converters"?) and use it in the directory structure for
the existing Orc implementation?

Thanks,
Micah


On Sun, Jul 21, 2019 at 6:09 PM Jacques Nadeau <ja...@apache.org> wrote:

> As I read through your responses, I think it might be useful to talk about
> adapters versus native Arrow readers/writers. Adapters are something that
> adapt an existing API to produce and/or consume Arrow data. A native
> reader/writer is something that understand the format directly and does not
> have intermediate representations or APIs the data moves through beyond
> those that needs to be used to complete work.
>
> If people want to write adapters for Arrow, I see that as useful but very
> different than writing native implementations and we should try to create a
> clear delineation between the two.
>
> Further comments inline.
>
>
>> Could you expand on what level of detail you would like to see a design
>> document?
>>
>
> A couple paragraphs seems sufficient. This is the goals of the
> implementation. We target existing functionality X. It is an adapter. Or it
> is a native impl. This is the expected memory and processing
> characteristics, etc.  I've never been one for huge amount of design but
> I've seen a number of recent patches appear where this is no upfront
> discussion. Making sure that multiple buy into a design is the best way to
> ensure long-term maintenance and use.
>
>
>> I think this should be optional (the same argument below about predicates
>> apply so I won't repeat them).
>>
>
> Per my comments above, maybe adapter versus native reader clarifies
> things. For example, I've been working on a native avro read
> implementation. It is little more than chicken scratch at this point but
> its goals, vision and design are very different than the adapter that is
> being produced atm.
>
>
>> Can you clarify the intent of this objective.  Is it mainly to tie in with
>> the existing Java arrow memory book keeping?  Performance?  Something
>> else?
>>
>
> Arrow is designed to be off-heap. If you have large variable amounts of
> on-heap memory in an application, it starts to make it very hard to make
> decisions about off-heap versus on-heap memory since those divisions are by
> and large static in nature. It's fine for short lived applications but for
> long lived applications, if you're working with a large amount of data, you
> want to keep most of your memory in one pool. In the context of Arrow, this
> is going to naturally be off-heap memory.
>
>
>> I'm afraid this might lead to a "perfect is the enemy of the good"
>> situation.  Starting off with a known good implementation of conversion to
>> Arrow can allow us to both to profile hot-spots and provide a comparison
>> of
>> implementations to verify correctness.
>>
>
> I'm not clear what message we're sending as a community if we produce low
> performance components. The whole of Arrow is to increase performance, not
> decrease it. I'm targeting good, not perfect. At the same time, from my
> perspective, Arrow development should not be approached in the same way
> that general Java app development should be. If we hold a high standard,
> we'll have less total integrations initially but I think we'll solve more
> real world problems.
>
> There is also the question of how widely adoptable we want Arrow libraries
>> to be.
>> It isn't surprising to me that Impala's Avro reader is an order of
>> magnitude faster then the stock Java one.  As far as I know Impala's is a
>> C++ implementation that does JIT with LLVM.  We could try to use it as a
>> basis for converting to Arrow but I think this might limit adoption in
>> some
>> circumstances.  Some organizations/people might be hesitant to adopt the
>> technology due to:
>> 1.  Use of JNI.
>> 2.  Use LLVM to do JIT.
>>
>> It seems that as long as we have a reasonably general interface to
>> data-sources we should be able to optimize/refactor aggressively when
>> needed.
>>
>
> This is somewhat the crux of the problem. It goes a little bit to who our
> consuming audience is and what we're trying to deliver. I'll also say that
> trying to build a high-quality implementation on top of low-quality
> implementation or library-based adapter is worse than starting from
> scratch. I believe this is especially true in Java where developers are
> trained to trust hotspot and that things will be good enough. That is great
> in a web app but not in systems software where we (and I expect others)
> will deploy Arrow.
>
>
>> >    3. Propose a generalized "reader" interface as opposed to making each
>> >    reader have a different way to package/integrate.
>>
>> This also seems like a good idea.  Is this something you were thinking of
>> doing or just a proposal that someone in the community should take up
>> before we get too many more implementations?
>>
>
> I don't have something in mind and didn't have a plan to build something,
> just want to make sure we start getting consistent early as opposed to once
> we have a bunch of readers/adapters.
>

Re: [DISCUSS][JAVA] Designs & goals for readers/writers

Posted by Jacques Nadeau <ja...@apache.org>.
As I read through your responses, I think it might be useful to talk about
adapters versus native Arrow readers/writers. Adapters are something that
adapt an existing API to produce and/or consume Arrow data. A native
reader/writer is something that understand the format directly and does not
have intermediate representations or APIs the data moves through beyond
those that needs to be used to complete work.

If people want to write adapters for Arrow, I see that as useful but very
different than writing native implementations and we should try to create a
clear delineation between the two.

Further comments inline.


> Could you expand on what level of detail you would like to see a design
> document?
>

A couple paragraphs seems sufficient. This is the goals of the
implementation. We target existing functionality X. It is an adapter. Or it
is a native impl. This is the expected memory and processing
characteristics, etc.  I've never been one for huge amount of design but
I've seen a number of recent patches appear where this is no upfront
discussion. Making sure that multiple buy into a design is the best way to
ensure long-term maintenance and use.


> I think this should be optional (the same argument below about predicates
> apply so I won't repeat them).
>

Per my comments above, maybe adapter versus native reader clarifies things.
For example, I've been working on a native avro read implementation. It is
little more than chicken scratch at this point but its goals, vision and
design are very different than the adapter that is being produced atm.


> Can you clarify the intent of this objective.  Is it mainly to tie in with
> the existing Java arrow memory book keeping?  Performance?  Something else?
>

Arrow is designed to be off-heap. If you have large variable amounts of
on-heap memory in an application, it starts to make it very hard to make
decisions about off-heap versus on-heap memory since those divisions are by
and large static in nature. It's fine for short lived applications but for
long lived applications, if you're working with a large amount of data, you
want to keep most of your memory in one pool. In the context of Arrow, this
is going to naturally be off-heap memory.


> I'm afraid this might lead to a "perfect is the enemy of the good"
> situation.  Starting off with a known good implementation of conversion to
> Arrow can allow us to both to profile hot-spots and provide a comparison of
> implementations to verify correctness.
>

I'm not clear what message we're sending as a community if we produce low
performance components. The whole of Arrow is to increase performance, not
decrease it. I'm targeting good, not perfect. At the same time, from my
perspective, Arrow development should not be approached in the same way
that general Java app development should be. If we hold a high standard,
we'll have less total integrations initially but I think we'll solve more
real world problems.

There is also the question of how widely adoptable we want Arrow libraries
> to be.
> It isn't surprising to me that Impala's Avro reader is an order of
> magnitude faster then the stock Java one.  As far as I know Impala's is a
> C++ implementation that does JIT with LLVM.  We could try to use it as a
> basis for converting to Arrow but I think this might limit adoption in some
> circumstances.  Some organizations/people might be hesitant to adopt the
> technology due to:
> 1.  Use of JNI.
> 2.  Use LLVM to do JIT.
>
> It seems that as long as we have a reasonably general interface to
> data-sources we should be able to optimize/refactor aggressively when
> needed.
>

This is somewhat the crux of the problem. It goes a little bit to who our
consuming audience is and what we're trying to deliver. I'll also say that
trying to build a high-quality implementation on top of low-quality
implementation or library-based adapter is worse than starting from
scratch. I believe this is especially true in Java where developers are
trained to trust hotspot and that things will be good enough. That is great
in a web app but not in systems software where we (and I expect others)
will deploy Arrow.


> >    3. Propose a generalized "reader" interface as opposed to making each
> >    reader have a different way to package/integrate.
>
> This also seems like a good idea.  Is this something you were thinking of
> doing or just a proposal that someone in the community should take up
> before we get too many more implementations?
>

I don't have something in mind and didn't have a plan to build something,
just want to make sure we start getting consistent early as opposed to once
we have a bunch of readers/adapters.

Re: [DISCUSS][JAVA] Designs & goals for readers/writers

Posted by Micah Kornfield <em...@gmail.com>.
Hi Jacques,
I added more comments/questions inline, but as a TL;DR; Generally these all
sound like good goals, but I have concern that as policy it might lead to a
"boil the ocean" type approach that could potentially delay useful
functionality.

Thanks,
Micah

On Sun, Jul 21, 2019 at 2:41 PM Jacques Nadeau <ja...@apache.org> wrote:

> I've seen a couple of recent pieces of work on generating new
> readers/writers for Arrow (Avro and discussion of CSV). I'd like to propose
> a couple of guidelines to help ensure a high quality bar:
>
>    1. Design review first - Before someone starts implementing a particular
>    reader/writer, let's ask for a basic design outline in jira, google
> docs,
>    etc.
>
Could you expand on what level of detail you would like to see a design
document?

   2. High bar for implementation: Having more readers for the sake of more
>    readers should not be the goal of the project. Instead, people should
>    expect Arrow Java readers to be high quality and faster than other
> readers
>    (even if the consumer has to do a final conversion to move from the
> Arrow
>    representation to their current internal representation). As such, I
>    propose the following two bars as part of design work:
>       1. Field selection support as part of reads - Make sure that each
>       implementation supports field selection (which columns to
> materialize) as
>       part of the interface.
>

I think this should be optional (the same argument below about predicates
apply so I won't repeat them).


>       2. Configurable target batch size - Different systems will want to
>       control the target size of batch data.
>

Agree this should be supported by all readers.  I view the Avro
implementation as a work in progress, but I did raise this on the PRs and
expect it should be done before we call the Avro work done.


>       3. Minimize use of heap memory - Most of the core existing Arrow Java
>       libraries have been very focused on minimizing on-heap memory
> consumption.
>       While there may be some, we continue to try reduce the footprint as
> small
>       as possible. When creating new readers/writers, I think we should
> target
>       the same standard for new readers. For example, the current Avro
> reader PR
>       relies heavily on the Java Avro project's reader implementation
> which has
>       very poor heap characteristics.
>

Can you clarify the intent of this objective.  Is it mainly to tie in with
the existing Java arrow memory book keeping?  Performance?  Something else?

      4. Industry leading performance - People should expect that using
>       Arrow stuff is very fast. Releasing something under this banner
> means we
>       should focus on achieving that kind of target. To pick on the Avro
> reader
>       again here, our previous analysis has shown that the Java Avro
> project's
>       reader (not the Arrow connected impl) is frequently an order of
> magnitude+
>       slower than some other open source Avro readers (such as Impala's
>       implementation), especially when applying any predicates or
> projections.
>

I'm afraid this might lead to a "perfect is the enemy of the good"
situation.  Starting off with a known good implementation of conversion to
Arrow can allow us to both to profile hot-spots and provide a comparison of
implementations to verify correctness.

There is also the question of how widely adoptable we want Arrow libraries
to be.
It isn't surprising to me that Impala's Avro reader is an order of
magnitude faster then the stock Java one.  As far as I know Impala's is a
C++ implementation that does JIT with LLVM.  We could try to use it as a
basis for converting to Arrow but I think this might limit adoption in some
circumstances.  Some organizations/people might be hesitant to adopt the
technology due to:
1.  Use of JNI.
2.  Use LLVM to do JIT.

It seems that as long as we have a reasonably general interface to
data-sources we should be able to optimize/refactor aggressively when
needed.

      5. (Ideally) Predicate application as part of reads - 99% in
>       workloads we've, a user is frequently applying one or more
> predicates when
>       reading data. Whatever performance you gain from a strong
> implementation
>       for reads will be drown out in most cases if you fail apply
> predicates as
>       part of reading (and thus have to materialize far more records
> than you'll
>       need in a minute).
>

I agree this would probably be useful, and something that should be
considered as part of a generalized reader.  It doesn't seem like it should
necessarily block implementations.  For instance, as far as I know this
isn't implemented in the C++ CSV Reader (and I'm pretty sure the other file
format readers we in C++ don't support it yet either).  Also, as far as I
know Apache Spark treats predicate push-downs on its data-sets as optional.


>    3. Propose a generalized "reader" interface as opposed to making each
>    reader have a different way to package/integrate.
>

This also seems like a good idea.  Is this something you were thinking of
doing or just a proposal that someone in the community should take up
before we get too many more implementations?