You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Ian Cook <ia...@ursacomputing.com> on 2023/04/11 21:35:55 UTC

Arrow community meeting April 12 at 16:00 UTC

Hi all,

Our biweekly Arrow community meeting is tomorrow at 16:00 UTC / 12:00 EDT.

Zoom meeting URL:
https://zoom.us/j/87649033008?pwd=SitsRHluQStlREM0TjJVYkRibVZsUT09
Meeting ID: 876 4903 3008
Passcode: 958092

The notes for this and future instances of this meeting will be
captured in this Google Doc:
https://docs.google.com/document/d/1xrji8fc6_24TVmKiHJB4ECX1Zy2sy2eRbBjpVJMnPmk/
If you plan to attend this meeting, you are welcome to edit the
document to add the topics that you would like to discuss.

Thanks,
Ian

Re: Arrow community meeting April 12 at 16:00 UTC

Posted by Gang Wu <us...@gmail.com>.
AFAIK, the Parquet PMC no longer governs parquet-cpp in the practice.

We should probably raise the issue to the private@parquet.apache.org for a
formal discussion.

Best,
Gang

On Sat, Apr 15, 2023 at 7:52 PM Andrew Lamb <al...@influxdata.com> wrote:

> > Rust Parquet was donated directly to the Arrow project and
> developed under its auspices after donation.
>
> Yes, this is my recollection as well -- the original implementation I
> believe is [1]
>
> Andrew
>
> [1] https://github.com/sunchao/parquet-rs
>
> On Fri, Apr 14, 2023 at 10:59 PM Micah Kornfield <em...@gmail.com>
> wrote:
>
> > >
> > > - Joris believes we can go ahead and do this; the Parquet Rust
> > > implementation did something similar
> >
> > Small note here, IIRC the origins of the code in Rust and Parquet are
> > different.  Rust Parquet was donated directly to the Arrow project and
> > developed under its auspices after donation.  Parquet-cpp integration at
> > the time was done with the agreement that it would still live under
> > governance of the Parquet PMC (with the hope of it getting split out
> again
> > at some point).  I think there has been enough code creep here that
> without
> > a significant amount of work separating out parquet C++ back out of Arrow
> > is likely not tenable.
> >
> > I pinged the thread again to see if we can get the parquet PMC to weigh
> in
> > here.
> >
> >
> >
> > On Wed, Apr 12, 2023 at 12:39 PM Ian Cook <ia...@ursacomputing.com> wrote:
> >
> > > Below is a summary of the notes from today's meeting:
> > >
> > > Attendees:
> > >
> > > - Ian Cook
> > > - Raúl Cumplido
> > > - Xuwei Fu
> > > - Will Jones
> > > - Bryce Mecum
> > > - Rok Mihevc
> > > - Sri Nadukudy
> > > - Ashish Paliwal
> > > - Dane Pitkin
> > > - David Dali Susanibar Arce
> > > - Matthew Topol
> > > - Joris Van den Bossche
> > > - Jacob Wujciak
> > >
> > >
> > > Discussion:
> > >
> > > 12.0.0 release
> > >
> > > - Code freeze is scheduled for later today, April 12
> > > - There are many nightly failures currently on main; Raúl and Jacob
> > > have opened several blocker issues and we might need to create more
> > > - Discussion of several current issues that might affect the release
> > >    - C# tests not finding Python
> > >    - PyArrow tests slowness on Windows [1]
> > >    - PyArrow wheels on Windows not uploading to Gemfury
> > > - Important items to mention in release changelog, release blog, etc.
> > >   - Drop support for Ubuntu 18.04 [2]
> > >   - Acero refactor (splitting Acero out from core Arrow library) [3]
> > >   - Fixed shape tensor extension type [4]
> > >   - Run-end encoded layout [5]
> > >   - Plasma removal [6] and suggested alternatives [7]
> > >   - Reminder about Jira to GitHub move (which happened just before the
> > > 11.0.0 release)
> > >   - Initial Swift implementation [8]
> > >   - nanoarrow (not technically a part of this release, but worth
> > > drawing attention to) [9]
> > >   - Also see ASF board report
> > >
> > >
> > > Parquet tickets are still tracked in the ASF Jira
> > >
> > > - We have to maintain a lot of code in Archery, etc. to automate the
> > > tracking of Parquet C++ issues which are still in Jira, even though
> > > there are only a few Parquet issues in each release (4 for 12.0.0)
> > >   - PARQUET-2201 Add stress test for RecordReader ReadRecords and
> > > SkipRecords. (#14879)
> > >   - PARQUET-2225 Allow reading dense with RecordReader (#17877)
> > >   - PARQUET-2232 Add an api to ColumnChunkMetaData to indicate if the
> > > column chunk uses a bloom filter (#33736)
> > >   - PARQUET-2250 Expose column descriptor through RecordReader (#34318)
> > > - Can we move the Parquet C++ issues from the ASF Jira to GitHub?
> > > - Joris believes we can go ahead and do this; the Parquet Rust
> > > implementation did something similar
> > > - There are already some Parquet issues that were reported and
> > > resolved in the Arrow monorepo in this release without ever being
> > > opened as Parquet Jira issues [10]
> > > - Check with Micah Kornfield, Fatemah Panah
> > > - There was a related Parquet mailing list discussion about this in
> > > February [11]
> > >
> > >
> > > [1] https://github.com/apache/arrow/issues/35078
> > > [2] https://github.com/apache/arrow/issues/33800
> > > [3] https://lists.apache.org/thread/5h5g9k9lvbybzl8fnbg4fppxczm42g6r
> > > [4]
> > >
> >
> https://arrow.apache.org/docs/dev/format/CanonicalExtensions.html#fixed-shape-tensor
> > > [5]
> > >
> >
> https://arrow.apache.org/docs/format/Columnar.html#run-end-encoded-layout
> > > [6] https://github.com/apache/arrow/pull/34718
> > > [7] https://lists.apache.org/thread/lk277x3b9gjol42sjg27bst2ggm5s0j2
> > > [8] https://github.com/apache/arrow/issues/20484
> > > [9] https://arrow.apache.org/blog/2023/03/07/nanoarrow-0.1.0-release/
> > > [10]
> > >
> >
> https://github.com/apache/arrow/issues?q=is%3Aissue+label%3A%22Component%3A+Parquet%22+is%3Aclosed
> > > [11] https://lists.apache.org/thread/jf9wos3t6xxk6xdyx2dof1jlkbpkr56p
> > >
> > >
> > > On Tue, Apr 11, 2023 at 5:35 PM Ian Cook <ia...@ursacomputing.com>
> wrote:
> > > >
> > > > Hi all,
> > > >
> > > > Our biweekly Arrow community meeting is tomorrow at 16:00 UTC / 12:00
> > > EDT.
> > > >
> > > > Zoom meeting URL:
> > > > https://zoom.us/j/87649033008?pwd=SitsRHluQStlREM0TjJVYkRibVZsUT09
> > > > Meeting ID: 876 4903 3008
> > > > Passcode: 958092
> > > >
> > > > The notes for this and future instances of this meeting will be
> > > > captured in this Google Doc:
> > > >
> > >
> >
> https://docs.google.com/document/d/1xrji8fc6_24TVmKiHJB4ECX1Zy2sy2eRbBjpVJMnPmk/
> > > > If you plan to attend this meeting, you are welcome to edit the
> > > > document to add the topics that you would like to discuss.
> > > >
> > > > Thanks,
> > > > Ian
> > >
> >
>

Re: Arrow community meeting April 12 at 16:00 UTC

Posted by Andrew Lamb <al...@influxdata.com>.
> Rust Parquet was donated directly to the Arrow project and
developed under its auspices after donation.

Yes, this is my recollection as well -- the original implementation I
believe is [1]

Andrew

[1] https://github.com/sunchao/parquet-rs

On Fri, Apr 14, 2023 at 10:59 PM Micah Kornfield <em...@gmail.com>
wrote:

> >
> > - Joris believes we can go ahead and do this; the Parquet Rust
> > implementation did something similar
>
> Small note here, IIRC the origins of the code in Rust and Parquet are
> different.  Rust Parquet was donated directly to the Arrow project and
> developed under its auspices after donation.  Parquet-cpp integration at
> the time was done with the agreement that it would still live under
> governance of the Parquet PMC (with the hope of it getting split out again
> at some point).  I think there has been enough code creep here that without
> a significant amount of work separating out parquet C++ back out of Arrow
> is likely not tenable.
>
> I pinged the thread again to see if we can get the parquet PMC to weigh in
> here.
>
>
>
> On Wed, Apr 12, 2023 at 12:39 PM Ian Cook <ia...@ursacomputing.com> wrote:
>
> > Below is a summary of the notes from today's meeting:
> >
> > Attendees:
> >
> > - Ian Cook
> > - Raúl Cumplido
> > - Xuwei Fu
> > - Will Jones
> > - Bryce Mecum
> > - Rok Mihevc
> > - Sri Nadukudy
> > - Ashish Paliwal
> > - Dane Pitkin
> > - David Dali Susanibar Arce
> > - Matthew Topol
> > - Joris Van den Bossche
> > - Jacob Wujciak
> >
> >
> > Discussion:
> >
> > 12.0.0 release
> >
> > - Code freeze is scheduled for later today, April 12
> > - There are many nightly failures currently on main; Raúl and Jacob
> > have opened several blocker issues and we might need to create more
> > - Discussion of several current issues that might affect the release
> >    - C# tests not finding Python
> >    - PyArrow tests slowness on Windows [1]
> >    - PyArrow wheels on Windows not uploading to Gemfury
> > - Important items to mention in release changelog, release blog, etc.
> >   - Drop support for Ubuntu 18.04 [2]
> >   - Acero refactor (splitting Acero out from core Arrow library) [3]
> >   - Fixed shape tensor extension type [4]
> >   - Run-end encoded layout [5]
> >   - Plasma removal [6] and suggested alternatives [7]
> >   - Reminder about Jira to GitHub move (which happened just before the
> > 11.0.0 release)
> >   - Initial Swift implementation [8]
> >   - nanoarrow (not technically a part of this release, but worth
> > drawing attention to) [9]
> >   - Also see ASF board report
> >
> >
> > Parquet tickets are still tracked in the ASF Jira
> >
> > - We have to maintain a lot of code in Archery, etc. to automate the
> > tracking of Parquet C++ issues which are still in Jira, even though
> > there are only a few Parquet issues in each release (4 for 12.0.0)
> >   - PARQUET-2201 Add stress test for RecordReader ReadRecords and
> > SkipRecords. (#14879)
> >   - PARQUET-2225 Allow reading dense with RecordReader (#17877)
> >   - PARQUET-2232 Add an api to ColumnChunkMetaData to indicate if the
> > column chunk uses a bloom filter (#33736)
> >   - PARQUET-2250 Expose column descriptor through RecordReader (#34318)
> > - Can we move the Parquet C++ issues from the ASF Jira to GitHub?
> > - Joris believes we can go ahead and do this; the Parquet Rust
> > implementation did something similar
> > - There are already some Parquet issues that were reported and
> > resolved in the Arrow monorepo in this release without ever being
> > opened as Parquet Jira issues [10]
> > - Check with Micah Kornfield, Fatemah Panah
> > - There was a related Parquet mailing list discussion about this in
> > February [11]
> >
> >
> > [1] https://github.com/apache/arrow/issues/35078
> > [2] https://github.com/apache/arrow/issues/33800
> > [3] https://lists.apache.org/thread/5h5g9k9lvbybzl8fnbg4fppxczm42g6r
> > [4]
> >
> https://arrow.apache.org/docs/dev/format/CanonicalExtensions.html#fixed-shape-tensor
> > [5]
> >
> https://arrow.apache.org/docs/format/Columnar.html#run-end-encoded-layout
> > [6] https://github.com/apache/arrow/pull/34718
> > [7] https://lists.apache.org/thread/lk277x3b9gjol42sjg27bst2ggm5s0j2
> > [8] https://github.com/apache/arrow/issues/20484
> > [9] https://arrow.apache.org/blog/2023/03/07/nanoarrow-0.1.0-release/
> > [10]
> >
> https://github.com/apache/arrow/issues?q=is%3Aissue+label%3A%22Component%3A+Parquet%22+is%3Aclosed
> > [11] https://lists.apache.org/thread/jf9wos3t6xxk6xdyx2dof1jlkbpkr56p
> >
> >
> > On Tue, Apr 11, 2023 at 5:35 PM Ian Cook <ia...@ursacomputing.com> wrote:
> > >
> > > Hi all,
> > >
> > > Our biweekly Arrow community meeting is tomorrow at 16:00 UTC / 12:00
> > EDT.
> > >
> > > Zoom meeting URL:
> > > https://zoom.us/j/87649033008?pwd=SitsRHluQStlREM0TjJVYkRibVZsUT09
> > > Meeting ID: 876 4903 3008
> > > Passcode: 958092
> > >
> > > The notes for this and future instances of this meeting will be
> > > captured in this Google Doc:
> > >
> >
> https://docs.google.com/document/d/1xrji8fc6_24TVmKiHJB4ECX1Zy2sy2eRbBjpVJMnPmk/
> > > If you plan to attend this meeting, you are welcome to edit the
> > > document to add the topics that you would like to discuss.
> > >
> > > Thanks,
> > > Ian
> >
>

Re: Arrow community meeting April 12 at 16:00 UTC

Posted by Micah Kornfield <em...@gmail.com>.
>
> - Joris believes we can go ahead and do this; the Parquet Rust
> implementation did something similar

Small note here, IIRC the origins of the code in Rust and Parquet are
different.  Rust Parquet was donated directly to the Arrow project and
developed under its auspices after donation.  Parquet-cpp integration at
the time was done with the agreement that it would still live under
governance of the Parquet PMC (with the hope of it getting split out again
at some point).  I think there has been enough code creep here that without
a significant amount of work separating out parquet C++ back out of Arrow
is likely not tenable.

I pinged the thread again to see if we can get the parquet PMC to weigh in
here.



On Wed, Apr 12, 2023 at 12:39 PM Ian Cook <ia...@ursacomputing.com> wrote:

> Below is a summary of the notes from today's meeting:
>
> Attendees:
>
> - Ian Cook
> - Raúl Cumplido
> - Xuwei Fu
> - Will Jones
> - Bryce Mecum
> - Rok Mihevc
> - Sri Nadukudy
> - Ashish Paliwal
> - Dane Pitkin
> - David Dali Susanibar Arce
> - Matthew Topol
> - Joris Van den Bossche
> - Jacob Wujciak
>
>
> Discussion:
>
> 12.0.0 release
>
> - Code freeze is scheduled for later today, April 12
> - There are many nightly failures currently on main; Raúl and Jacob
> have opened several blocker issues and we might need to create more
> - Discussion of several current issues that might affect the release
>    - C# tests not finding Python
>    - PyArrow tests slowness on Windows [1]
>    - PyArrow wheels on Windows not uploading to Gemfury
> - Important items to mention in release changelog, release blog, etc.
>   - Drop support for Ubuntu 18.04 [2]
>   - Acero refactor (splitting Acero out from core Arrow library) [3]
>   - Fixed shape tensor extension type [4]
>   - Run-end encoded layout [5]
>   - Plasma removal [6] and suggested alternatives [7]
>   - Reminder about Jira to GitHub move (which happened just before the
> 11.0.0 release)
>   - Initial Swift implementation [8]
>   - nanoarrow (not technically a part of this release, but worth
> drawing attention to) [9]
>   - Also see ASF board report
>
>
> Parquet tickets are still tracked in the ASF Jira
>
> - We have to maintain a lot of code in Archery, etc. to automate the
> tracking of Parquet C++ issues which are still in Jira, even though
> there are only a few Parquet issues in each release (4 for 12.0.0)
>   - PARQUET-2201 Add stress test for RecordReader ReadRecords and
> SkipRecords. (#14879)
>   - PARQUET-2225 Allow reading dense with RecordReader (#17877)
>   - PARQUET-2232 Add an api to ColumnChunkMetaData to indicate if the
> column chunk uses a bloom filter (#33736)
>   - PARQUET-2250 Expose column descriptor through RecordReader (#34318)
> - Can we move the Parquet C++ issues from the ASF Jira to GitHub?
> - Joris believes we can go ahead and do this; the Parquet Rust
> implementation did something similar
> - There are already some Parquet issues that were reported and
> resolved in the Arrow monorepo in this release without ever being
> opened as Parquet Jira issues [10]
> - Check with Micah Kornfield, Fatemah Panah
> - There was a related Parquet mailing list discussion about this in
> February [11]
>
>
> [1] https://github.com/apache/arrow/issues/35078
> [2] https://github.com/apache/arrow/issues/33800
> [3] https://lists.apache.org/thread/5h5g9k9lvbybzl8fnbg4fppxczm42g6r
> [4]
> https://arrow.apache.org/docs/dev/format/CanonicalExtensions.html#fixed-shape-tensor
> [5]
> https://arrow.apache.org/docs/format/Columnar.html#run-end-encoded-layout
> [6] https://github.com/apache/arrow/pull/34718
> [7] https://lists.apache.org/thread/lk277x3b9gjol42sjg27bst2ggm5s0j2
> [8] https://github.com/apache/arrow/issues/20484
> [9] https://arrow.apache.org/blog/2023/03/07/nanoarrow-0.1.0-release/
> [10]
> https://github.com/apache/arrow/issues?q=is%3Aissue+label%3A%22Component%3A+Parquet%22+is%3Aclosed
> [11] https://lists.apache.org/thread/jf9wos3t6xxk6xdyx2dof1jlkbpkr56p
>
>
> On Tue, Apr 11, 2023 at 5:35 PM Ian Cook <ia...@ursacomputing.com> wrote:
> >
> > Hi all,
> >
> > Our biweekly Arrow community meeting is tomorrow at 16:00 UTC / 12:00
> EDT.
> >
> > Zoom meeting URL:
> > https://zoom.us/j/87649033008?pwd=SitsRHluQStlREM0TjJVYkRibVZsUT09
> > Meeting ID: 876 4903 3008
> > Passcode: 958092
> >
> > The notes for this and future instances of this meeting will be
> > captured in this Google Doc:
> >
> https://docs.google.com/document/d/1xrji8fc6_24TVmKiHJB4ECX1Zy2sy2eRbBjpVJMnPmk/
> > If you plan to attend this meeting, you are welcome to edit the
> > document to add the topics that you would like to discuss.
> >
> > Thanks,
> > Ian
>

Re: Arrow community meeting April 12 at 16:00 UTC

Posted by Ian Cook <ia...@ursacomputing.com>.
Below is a summary of the notes from today's meeting:

Attendees:

- Ian Cook
- Raúl Cumplido
- Xuwei Fu
- Will Jones
- Bryce Mecum
- Rok Mihevc
- Sri Nadukudy
- Ashish Paliwal
- Dane Pitkin
- David Dali Susanibar Arce
- Matthew Topol
- Joris Van den Bossche
- Jacob Wujciak


Discussion:

12.0.0 release

- Code freeze is scheduled for later today, April 12
- There are many nightly failures currently on main; Raúl and Jacob
have opened several blocker issues and we might need to create more
- Discussion of several current issues that might affect the release
   - C# tests not finding Python
   - PyArrow tests slowness on Windows [1]
   - PyArrow wheels on Windows not uploading to Gemfury
- Important items to mention in release changelog, release blog, etc.
  - Drop support for Ubuntu 18.04 [2]
  - Acero refactor (splitting Acero out from core Arrow library) [3]
  - Fixed shape tensor extension type [4]
  - Run-end encoded layout [5]
  - Plasma removal [6] and suggested alternatives [7]
  - Reminder about Jira to GitHub move (which happened just before the
11.0.0 release)
  - Initial Swift implementation [8]
  - nanoarrow (not technically a part of this release, but worth
drawing attention to) [9]
  - Also see ASF board report


Parquet tickets are still tracked in the ASF Jira

- We have to maintain a lot of code in Archery, etc. to automate the
tracking of Parquet C++ issues which are still in Jira, even though
there are only a few Parquet issues in each release (4 for 12.0.0)
  - PARQUET-2201 Add stress test for RecordReader ReadRecords and
SkipRecords. (#14879)
  - PARQUET-2225 Allow reading dense with RecordReader (#17877)
  - PARQUET-2232 Add an api to ColumnChunkMetaData to indicate if the
column chunk uses a bloom filter (#33736)
  - PARQUET-2250 Expose column descriptor through RecordReader (#34318)
- Can we move the Parquet C++ issues from the ASF Jira to GitHub?
- Joris believes we can go ahead and do this; the Parquet Rust
implementation did something similar
- There are already some Parquet issues that were reported and
resolved in the Arrow monorepo in this release without ever being
opened as Parquet Jira issues [10]
- Check with Micah Kornfield, Fatemah Panah
- There was a related Parquet mailing list discussion about this in
February [11]


[1] https://github.com/apache/arrow/issues/35078
[2] https://github.com/apache/arrow/issues/33800
[3] https://lists.apache.org/thread/5h5g9k9lvbybzl8fnbg4fppxczm42g6r
[4] https://arrow.apache.org/docs/dev/format/CanonicalExtensions.html#fixed-shape-tensor
[5] https://arrow.apache.org/docs/format/Columnar.html#run-end-encoded-layout
[6] https://github.com/apache/arrow/pull/34718
[7] https://lists.apache.org/thread/lk277x3b9gjol42sjg27bst2ggm5s0j2
[8] https://github.com/apache/arrow/issues/20484
[9] https://arrow.apache.org/blog/2023/03/07/nanoarrow-0.1.0-release/
[10] https://github.com/apache/arrow/issues?q=is%3Aissue+label%3A%22Component%3A+Parquet%22+is%3Aclosed
[11] https://lists.apache.org/thread/jf9wos3t6xxk6xdyx2dof1jlkbpkr56p


On Tue, Apr 11, 2023 at 5:35 PM Ian Cook <ia...@ursacomputing.com> wrote:
>
> Hi all,
>
> Our biweekly Arrow community meeting is tomorrow at 16:00 UTC / 12:00 EDT.
>
> Zoom meeting URL:
> https://zoom.us/j/87649033008?pwd=SitsRHluQStlREM0TjJVYkRibVZsUT09
> Meeting ID: 876 4903 3008
> Passcode: 958092
>
> The notes for this and future instances of this meeting will be
> captured in this Google Doc:
> https://docs.google.com/document/d/1xrji8fc6_24TVmKiHJB4ECX1Zy2sy2eRbBjpVJMnPmk/
> If you plan to attend this meeting, you are welcome to edit the
> document to add the topics that you would like to discuss.
>
> Thanks,
> Ian