You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by Tim Armstrong <ta...@cloudera.com.INVALID> on 2019/03/08 01:10:07 UTC

Re: Impala/parquet-cpp

Changed the subject to avoid derailing the other thread.

Thanks for challenging me on this, I think this could be a constructive
conversation. I agree that a better parquet-cpp has benefits for the
overall ecosystem. I've had out-of-band conversations with various people,
including Zoltan Ivanfi, about more code sharing, so there definitely
interest in the benefits of it.

I'm not sure how viable a big-bang switch to parquet-cpp would be, given
the uncertainty around the performance implications and the invasiveness of
changes that would be required both in Impala and parquet-cpp - there would
be a lot of work required to get to the point of figuring out how close we
are to our goal. It would be pretty painful to do the initial integration
work and then discover that there's another huge tranche of work to do to
get performance up to parity. I think a lot of my concerns are specific to
doing a big-bang change rather than to sharing more code in general.

It would be interesting to talk about whether there's a way we could
incrementally start sharing code - that seems more palatable unless someone
is going to take on a more heroic task. E.g. if parquet-cpp and/or arrow
exposed utilities and subroutines used for decoding, maybe we could use
those and contribute back any improvements we made. I'm unsure whether
these are part of the public API, or could be part of the public API. Or
maybe we could look at the column writers, since the write path in Impala
is a lot less optimised and the bar is lower.

- Tim

On Thu, Mar 7, 2019 at 11:07 AM Wes McKinney <we...@gmail.com> wrote:

> hi Tim,
>
> On Thu, Mar 7, 2019 at 11:52 AM Tim Armstrong
> <ta...@cloudera.com.invalid> wrote:
> >
> > I think you and I have different priors on this, Wes. It's definitely not
> > clear-cut. I think it's an interesting point to discuss and it's
> > unfortunate that you feel that way.
> >
> > Partially the current state of things is due to path-dependence, but
> there
> > are some parts of the Impala runtime that it's important for us to
> > integrate with, e.g. I/O, codegen, memory management, error handling,
> query
> > cancellation, etc. It's hard to integrate with those if we have an
> external
> > library driving the Parquet scanning.
> >
>
> I am confident that these issues could be resolved to mutual
> satisfaction if there was the desire to do so. It would probably
> require some refactoring to externalize some internal interfaces (like
> IO and memory management) in suitable form to plug into the Impala
> runtime, but this would be a good exercise and result in a better
> piece of software.
>
> On the decoding hot path, I do not see why it would be impossible to
> expose the intermediate encoded column chunks to Impala's LLVM IR in
> sufficient form to populate RowBatch tuples with low overhead.
>
> I am the author of the most lines of code in parquet-cpp and it gave
> me no satisfaction to have to copy and paste so much code out of the
> Impala codebase to initially build the project. In part as a result of
> this duplicated effort, there are still parts of parquet-cpp that are
> unfinished, like nested data conversion to and from the Arrow columnar
> format.
>
> > I agree if we think parquet-cpp is completely mutable and we can make
> > whatever tweaks and add whatever hooks or interfaces to it that we need
> to
> > get the behaviour we want, then we could get most of the advantages of
> > tight coupling. But that might end up just being tight coupling with a
> > library boundary in the middle, which is worse in many way. I'm also not
> > sure that parquet-cpp is that mutable - at some point I think there would
> > be resistance to adding hooks that nobody else wants or refactoring
> things
> > in a way that worked for us. We may also prefer different trade-offs in
> > complexity vs performance to the rest of the community.
> >
>
> The "original sin" here was the failure to develop a C++ library in
> Apache Parquet that catered to the needs of Impala from the get-go. I
> realize this decision predates both of our involvements in either
> project, and so I feel that we've had to make lemons out of lemonade.
> However it's at times like these when we are having to duplicate
> efforts that the community is harmed by what I believe to be a mistake
> of the past.
>
> It would have been more effort to build a general purpose library from
> day one, but it would also have made Apache Parquet more successful by
> becoming more widely adopted earlier in the native-code consuming
> community. I suspect that would have also had indirect benefits for
> Apache Impala by improving the quality of the "funnel" of structured
> data into places where Impala can query it.
>
> - Wes
>
> > There's also the question of whether the hoop-jumping required to do
> > optimisations across library and project boundaries is lower effort than
> > implementing things in the simplest way possible in one place but paying
> > the cost of duplication. I can see why someone might come down on the
> other
> > side on this point.
> >
> > For most open source libraries we use, we'd accept the trade-off of less
> > control for the advantages of having a community around the library. But
> > for stuff that's performance-critical, we're generally willing to invest
> > more time into maintaining it.
> >
> > Interestingly we have a bit of a natural experiment going with the ORC
> > scanner implementation which took the opposite approach. There's a huge
> gap
> > in performance and the memory management is pretty bad (the Parquet
> > implementation can do scans with a fixed memory budget, etc). Some of
> that
> > is just an effort thing but once you dig into it there are some deeper
> > issues that require the kind of hooks that I was talking about. I did a
> > presentation recently that enumerated some of the challenges, might as
> well
> > share:
> >
> https://docs.google.com/presentation/d/1o-vAm933Q1WvApzSnYK0bpIhmJ0w1-9xdaKF2wu-7FQ/edit?usp=sharing
> >
> > Anyway, sorry for the diatribe, but I do find it an interesting
> discussion
> > and am quite opinionated about it.
> >
> > - Tim
> >
> > On Thu, Mar 7, 2019 at 8:04 AM Wes McKinney <we...@gmail.com> wrote:
> >
> > > It makes me very sad that Impala has this bespoke Parquet
> > > implementation. I never really understood the benefit of doing the
> > > work there rather than in Apache Parquet. I never found the arguments
> > > "why" I've heard over the years (that the implementation needed to be
> > > tightly coupled to the Impala runtime) to be persuasive. At this point
> > > it is probably too late to do anything about it
> > >
> > > Thanks
> > >
>