You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Andrew Lamb <al...@influxdata.com> on 2022/04/17 10:46:22 UTC
Re: [DISCUSS][Rust] Biweekly sync call for arrow/datafusion again?

There are a few interesting topics that I think would make a good sync up
in the sync document. For personal reasons I can no longer host such
meetings on Thursdays for a while but I would like to propose one for next
Wednesday March 20, 2022 at 15:00 UTC. Please see the document for more
details and to offer comments.

I also want to remind the community that anyone should feel free to
organize meetups on days / timezones that work well for them and publicize
them in the document, in the slack channel and on the mailing list.

Andrew

[1]
https://docs.google.com/document/d/1atCVnoff5SR4eM4Lwf2M1BBJTY6g3_HUNR6qswYJW_U/edit#

On Fri, Mar 11, 2022 at 4:56 PM Bob Tinsman <bo...@gmail.com> wrote:

> I just missed the call, but I watched the recording (thank you to Andrew
> for posting [1]). Really interesting!
> I'm diving into Arrow because I have some previous experience with
> in-memory query engines. I'm following discussions around improving
> performance and adding features so I can determine how best to contribute.
>
> In particular, I was interested in some of the background for the JIT
> implementation [2] and the row format [3] but I guess I'm missing context.
> I saw the comment in #1708 [4] that "many pipeline-breaking operators are
> inherently row-based".
> My questions:
> - By "pipeline-breaking" I assume you mean "very slow", but can you give me
> details? Does this arise from some particular observation, or other
> reported issues?
>   - An example would be nice, like "select a, b, c from blah order by d"
> with table "blah" having 1 million rows and 10 columns takes 5 minutes, or
> even anecdotal evidence like mailing list discussions
> - In general, what tools are you using to analyze datafusion performance?
>   - The criterion benchmarks are nice but do you have anything higher-level
> which exercises a broad range of workloads?
>   - How much profiling have you done to identify bottlenecks?
>
> To be honest, I was kind of surprised to see using a row format to solve a
> performance problem, but I figured you must have good reasons, and I'm
> still getting my brain around datafusion's query execution model. Thanks
> for any illumination!
>
> [1] https://youtu.be/5NJcqXm6uE0
> [2] https://github.com/apache/arrow-datafusion/pull/1849
> [3] https://github.com/apache/arrow-datafusion/pull/1782
> [4] https://github.com/apache/arrow-datafusion/issues/1708
>
> On Tue, Mar 8, 2022 at 12:25 PM Andrew Lamb <al...@influxdata.com> wrote:
>
> > I am not sure if everyone saw it in the agenda[1], but we plan to have a
> > meeting tomorrow. I'll plan to record it for anyone who can not make this
> > time.
> >
> > 15:00 UTC Wednesday March 9, 2022
> > Meeting Location: (in agenda)
> > Matthew Turner:  focused on JIT and row representation, next Wednesday,
> > March 9th,
> > @yijie: JIT  overview
> >
> > [1]
> >
> >
> https://docs.google.com/document/d/1atCVnoff5SR4eM4Lwf2M1BBJTY6g3_HUNR6qswYJW_U/edit#
> >
> > On Thu, Mar 3, 2022 at 12:50 AM Benson Muite <benson_muite@emailplus.org
> >
> > wrote:
> >
> > > Interested in learning more about this. Can work through the code and
> > > discuss on 17 March either 4:00 or 16:00 UTC.
> > >
> > > Benson
> > >
> > > On 3/3/22 12:03 AM, Andrew Lamb wrote:
> > > > I noticed that Matthew Turner added a note to the agenda[1] for a
> walk
> > > > through of the JIT code. I would be interested in this as well --
> would
> > > > anyone plan to be on the call and discuss it?
> > > >
> > > > I don't think I have time to prepare that content prior
> > > >
> > > > Andrew
> > > >
> > > > [1]
> > > >
> > >
> >
> https://docs.google.com/document/d/1atCVnoff5SR4eM4Lwf2M1BBJTY6g3_HUNR6qswYJW_U/edit#
> > > >
> > >
> >
>