You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@arrow.apache.org by Adesola Adedewe <av...@g.rit.edu> on 2023/01/19 03:04:53 UTC

New Pandas-Apache repo

https://github.com/ava6969/panda-arrow.git

please review this new repo i made, writing nice wrappers over arrow
apache using panda interfaces. i'll love to add it to the main repo.

Re: New Pandas-Apache repo

Posted by Adesola Adedewe <av...@g.rit.edu>.

Thanks for taking the time for reviewing . It work great for my use-case,
as i am doing a one shot loading and data manipulation on big data. And the
data is basically immutable for the rest of the lifetime of the process,
just read. but i know from testing and benchmarking that it is limited by
the fact that we cant easily perform inplace operation, i’m still a novice
at the kernel portion of arrow so im sure they are opportunities to improve
performance.

On Fri, Jan 27, 2023 at 9:53 AM Weston Pace <we...@gmail.com> wrote:

> The new kernels are interesting.  There has been some ask recently[1]
> for weighted averages and I think you have some of the pieces (if not
> all of it) here.  We also recently plumbed in support for binary
> aggregates into Acero[2] so having more binary aggregate kernels would
> be nice.
>
> Outside of the kernels I agree with Kou that this probably doesn't
> need to be a part of the main repo.  There is already some discussion
> of splitting the main repo itself[3] in the interest of having smaller
> composable pieces over monolithic pieces.
>
> We do want tools like this to exist and flourish so I think Benson's
> idea of a blog post is nice.  We also have a "powered by" page[4].
>
> As for the library itself:
>
> This appears to be a fairly faithful reproduction of the pandas API
> and I imagine would be quite friendly to those coming from python.
> Since you are using compute kernels directly you are going to be
> limited to operating on what you can fit in memory (though I'm sure
> there are plenty of valid use cases in this space).  I think the
> primary challenge you will encounter in seeking users will be that the
> audience of C++ data scientists is pretty small.  You aren't, for
> example, going to get a significant performance boost over pandas /
> numpy (as they use C/C++ for the heavy lifting already) and so the
> real benefit will only be for those that are stuck using C++ already.
>
> [1] https://github.com/apache/arrow/issues/15103
> [2] https://github.com/apache/arrow/pull/15083
> [3] https://github.com/apache/arrow/issues/15280
> [4] https://arrow.apache.org/powered_by/
>
> On Sun, Jan 22, 2023 at 2:44 AM Adesola Adedewe <av...@g.rit.edu> wrote:
> >
> > Yes I will, I haven't taken enough time to clean up the README , it was
> > generated based on my source code with CHATGPT. I will do that later in
> the
> > week.
> >
> > On Sun, Jan 22, 2023 at 2:36 AM Benson Muite <benson_muite@emailplus.org
> >
> > wrote:
> >
> > > On 1/22/23 13:15, Adesola Adedewe wrote:
> > > > i'm working on a project where big financial data needs to be loaded
> > > stored
> > > > and manipulated. the data is stored as parquet. my initial version
> had
> > > > arrow just load the parquet data and i used the basic unorderedmap
> but
> > > this
> > > > limited me to only one data type. i found i could make my database
> more
> > > > generic with arrow and its performance benefits. unfortunately my
> team is
> > > > mostly filled with python dev, so i decided to write a cleaner
> interface
> > > > over arrow, and using interfaces closer to panda. This enabled us to
> use
> > > > fewer lines of code as well, and still enjoy the benefit. i will
> write a
> > > > blog post later, i was mostly looking for other developers looking to
> > > > collaborate, or who may need this as well. not necessarily add it to
> the
> > > > main library, but i'm not opposed to that. I also implemented some
> > > > custom kernels like covariance correlation, cumprod, shift,
> pctchange.
> > > >
> > >
> > > The context is very helpful. A blog post would certainly alert others
> in
> > > the Arrow community of your work.  Most developers are over burdened,
> so
> > > explaining a use case and how it may help them would encourage
> > > exploration and review of your repository, so would encourage a blog
> > > post that alerts the wider Arrow developer community about your work.
> > > Updating the README of your repository would also encourage use.
> > >
> > >
>

Re: New Pandas-Apache repo

Posted by Weston Pace <we...@gmail.com>.

The new kernels are interesting.  There has been some ask recently[1]
for weighted averages and I think you have some of the pieces (if not
all of it) here.  We also recently plumbed in support for binary
aggregates into Acero[2] so having more binary aggregate kernels would
be nice.

Outside of the kernels I agree with Kou that this probably doesn't
need to be a part of the main repo.  There is already some discussion
of splitting the main repo itself[3] in the interest of having smaller
composable pieces over monolithic pieces.

We do want tools like this to exist and flourish so I think Benson's
idea of a blog post is nice.  We also have a "powered by" page[4].

As for the library itself:

This appears to be a fairly faithful reproduction of the pandas API
and I imagine would be quite friendly to those coming from python.
Since you are using compute kernels directly you are going to be
limited to operating on what you can fit in memory (though I'm sure
there are plenty of valid use cases in this space).  I think the
primary challenge you will encounter in seeking users will be that the
audience of C++ data scientists is pretty small.  You aren't, for
example, going to get a significant performance boost over pandas /
numpy (as they use C/C++ for the heavy lifting already) and so the
real benefit will only be for those that are stuck using C++ already.

[1] https://github.com/apache/arrow/issues/15103
[2] https://github.com/apache/arrow/pull/15083
[3] https://github.com/apache/arrow/issues/15280
[4] https://arrow.apache.org/powered_by/

On Sun, Jan 22, 2023 at 2:44 AM Adesola Adedewe <av...@g.rit.edu> wrote:
>
> Yes I will, I haven't taken enough time to clean up the README , it was
> generated based on my source code with CHATGPT. I will do that later in the
> week.
>
> On Sun, Jan 22, 2023 at 2:36 AM Benson Muite <be...@emailplus.org>
> wrote:
>
> > On 1/22/23 13:15, Adesola Adedewe wrote:
> > > i'm working on a project where big financial data needs to be loaded
> > stored
> > > and manipulated. the data is stored as parquet. my initial version had
> > > arrow just load the parquet data and i used the basic unorderedmap but
> > this
> > > limited me to only one data type. i found i could make my database more
> > > generic with arrow and its performance benefits. unfortunately my team is
> > > mostly filled with python dev, so i decided to write a cleaner interface
> > > over arrow, and using interfaces closer to panda. This enabled us to use
> > > fewer lines of code as well, and still enjoy the benefit. i will write a
> > > blog post later, i was mostly looking for other developers looking to
> > > collaborate, or who may need this as well. not necessarily add it to the
> > > main library, but i'm not opposed to that. I also implemented some
> > > custom kernels like covariance correlation, cumprod, shift, pctchange.
> > >
> >
> > The context is very helpful. A blog post would certainly alert others in
> > the Arrow community of your work.  Most developers are over burdened, so
> > explaining a use case and how it may help them would encourage
> > exploration and review of your repository, so would encourage a blog
> > post that alerts the wider Arrow developer community about your work.
> > Updating the README of your repository would also encourage use.
> >
> >

Re: New Pandas-Apache repo

Posted by Adesola Adedewe <av...@g.rit.edu>.

Yes I will, I haven't taken enough time to clean up the README , it was
generated based on my source code with CHATGPT. I will do that later in the
week.

On Sun, Jan 22, 2023 at 2:36 AM Benson Muite <be...@emailplus.org>
wrote:

> On 1/22/23 13:15, Adesola Adedewe wrote:
> > i'm working on a project where big financial data needs to be loaded
> stored
> > and manipulated. the data is stored as parquet. my initial version had
> > arrow just load the parquet data and i used the basic unorderedmap but
> this
> > limited me to only one data type. i found i could make my database more
> > generic with arrow and its performance benefits. unfortunately my team is
> > mostly filled with python dev, so i decided to write a cleaner interface
> > over arrow, and using interfaces closer to panda. This enabled us to use
> > fewer lines of code as well, and still enjoy the benefit. i will write a
> > blog post later, i was mostly looking for other developers looking to
> > collaborate, or who may need this as well. not necessarily add it to the
> > main library, but i'm not opposed to that. I also implemented some
> > custom kernels like covariance correlation, cumprod, shift, pctchange.
> >
>
> The context is very helpful. A blog post would certainly alert others in
> the Arrow community of your work.  Most developers are over burdened, so
> explaining a use case and how it may help them would encourage
> exploration and review of your repository, so would encourage a blog
> post that alerts the wider Arrow developer community about your work.
> Updating the README of your repository would also encourage use.
>
>

Re: New Pandas-Apache repo

Posted by Benson Muite <be...@emailplus.org>.

On 1/22/23 13:15, Adesola Adedewe wrote:
> i'm working on a project where big financial data needs to be loaded stored
> and manipulated. the data is stored as parquet. my initial version had
> arrow just load the parquet data and i used the basic unorderedmap but this
> limited me to only one data type. i found i could make my database more
> generic with arrow and its performance benefits. unfortunately my team is
> mostly filled with python dev, so i decided to write a cleaner interface
> over arrow, and using interfaces closer to panda. This enabled us to use
> fewer lines of code as well, and still enjoy the benefit. i will write a
> blog post later, i was mostly looking for other developers looking to
> collaborate, or who may need this as well. not necessarily add it to the
> main library, but i'm not opposed to that. I also implemented some
> custom kernels like covariance correlation, cumprod, shift, pctchange.
> 

The context is very helpful. A blog post would certainly alert others in
the Arrow community of your work.  Most developers are over burdened, so
explaining a use case and how it may help them would encourage
exploration and review of your repository, so would encourage a blog
post that alerts the wider Arrow developer community about your work.
Updating the README of your repository would also encourage use.

Re: New Pandas-Apache repo

Posted by Adesola Adedewe <av...@g.rit.edu>.

i'm working on a project where big financial data needs to be loaded stored
and manipulated. the data is stored as parquet. my initial version had
arrow just load the parquet data and i used the basic unorderedmap but this
limited me to only one data type. i found i could make my database more
generic with arrow and its performance benefits. unfortunately my team is
mostly filled with python dev, so i decided to write a cleaner interface
over arrow, and using interfaces closer to panda. This enabled us to use
fewer lines of code as well, and still enjoy the benefit. i will write a
blog post later, i was mostly looking for other developers looking to
collaborate, or who may need this as well. not necessarily add it to the
main library, but i'm not opposed to that. I also implemented some
custom kernels like covariance correlation, cumprod, shift, pctchange.

On Sun, Jan 22, 2023 at 1:56 AM Benson Muite <be...@emailplus.org>
wrote:

> On 1/22/23 11:41, Adesola Adedewe wrote:
> > The project was initially meant to provide a simpler interface over arrow
> > apache so pretty much what was done with the python api, but it has
> > evolved to be more than that ,with indexing and other panda operations
> > implemented like reindex, resample, concat etc. I currently have it good
> > enough for my project but I think it has potential to also open the door
> > for more developers to use arrow for their projects. please take a look.
> >
>
> Thanks.  What problem did this solve for you?  How did you utilize it
> for your project?  Maybe you could contribute a blog post to Arrow
> describing the end use case and the motivation for a C++ dataframe
> interface?
>
>

Re: New Pandas-Apache repo

Posted by Benson Muite <be...@emailplus.org>.

On 1/22/23 11:41, Adesola Adedewe wrote:
> The project was initially meant to provide a simpler interface over arrow
> apache so pretty much what was done with the python api, but it has
> evolved to be more than that ,with indexing and other panda operations
> implemented like reindex, resample, concat etc. I currently have it good
> enough for my project but I think it has potential to also open the door
> for more developers to use arrow for their projects. please take a look.
> 

Thanks.  What problem did this solve for you?  How did you utilize it
for your project?  Maybe you could contribute a blog post to Arrow
describing the end use case and the motivation for a C++ dataframe
interface?

Re: New Pandas-Apache repo

Posted by Adesola Adedewe <av...@g.rit.edu>.

The project was initially meant to provide a simpler interface over arrow
apache so pretty much what was done with the python api, but it has
evolved to be more than that ,with indexing and other panda operations
implemented like reindex, resample, concat etc. I currently have it good
enough for my project but I think it has potential to also open the door
for more developers to use arrow for their projects. please take a look.

On Sun, Jan 22, 2023 at 12:22 AM Benson Muite <be...@emailplus.org>
wrote:

>
> On 1/22/23 06:23, Adesola Adedewe wrote:
> > okay thanks for your consideration.
> >
> > On Sat, Jan 21, 2023 at 4:49 PM Sutou Kouhei <ko...@clear-code.com> wrote:
> >
> >> Hi,
> >>
> >> I'm not sure pandas like API is suitable for our official
> >> data frame API.
> >>
> >> FYI:
> >>
> >> * GitHub issue of this:
> >>   https://github.com/apache/arrow/issues/33747
> >> * [DISCUSS] Developing a "data frame" subproject in the Arrow C++
> libraries
> >>   https://lists.apache.org/thread/50vbmw49w83sj3km326srown64c7hlf1
> >>
> >>
> >> Thanks,
> >> --
> >> kou
> >>
> >> In <CA...@mail.gmail.com>
> >>   "New Pandas-Apache repo" on Wed, 18 Jan 2023 19:04:53 -0800,
> >>   Adesola Adedewe <av...@g.rit.edu> wrote:
> >>
> >>> https://github.com/ava6969/panda-arrow.git
> >>>
> How would this compare to [1] and [2]?
>
> 1} https://github.com/xtensor-stack/xframe
> 2) https://github.com/hosseinmoein/DataFrame
>
> >>> please review this new repo i made, writing nice wrappers over arrow
> >>> apache using panda interfaces. i'll love to add it to the main repo.
> >>
> >
>
>

Re: New Pandas-Apache repo

Posted by Benson Muite <be...@emailplus.org>.

On 1/22/23 06:23, Adesola Adedewe wrote:
> okay thanks for your consideration.
> 
> On Sat, Jan 21, 2023 at 4:49 PM Sutou Kouhei <ko...@clear-code.com> wrote:
> 
>> Hi,
>>
>> I'm not sure pandas like API is suitable for our official
>> data frame API.
>>
>> FYI:
>>
>> * GitHub issue of this:
>>   https://github.com/apache/arrow/issues/33747
>> * [DISCUSS] Developing a "data frame" subproject in the Arrow C++ libraries
>>   https://lists.apache.org/thread/50vbmw49w83sj3km326srown64c7hlf1
>>
>>
>> Thanks,
>> --
>> kou
>>
>> In <CA...@mail.gmail.com>
>>   "New Pandas-Apache repo" on Wed, 18 Jan 2023 19:04:53 -0800,
>>   Adesola Adedewe <av...@g.rit.edu> wrote:
>>
>>> https://github.com/ava6969/panda-arrow.git
>>>
How would this compare to [1] and [2]?

1} https://github.com/xtensor-stack/xframe
2) https://github.com/hosseinmoein/DataFrame

>>> please review this new repo i made, writing nice wrappers over arrow
>>> apache using panda interfaces. i'll love to add it to the main repo.
>>
>

Re: New Pandas-Apache repo

Posted by Adesola Adedewe <av...@g.rit.edu>.

okay thanks for your consideration.

On Sat, Jan 21, 2023 at 4:49 PM Sutou Kouhei <ko...@clear-code.com> wrote:

> Hi,
>
> I'm not sure pandas like API is suitable for our official
> data frame API.
>
> FYI:
>
> * GitHub issue of this:
>   https://github.com/apache/arrow/issues/33747
> * [DISCUSS] Developing a "data frame" subproject in the Arrow C++ libraries
>   https://lists.apache.org/thread/50vbmw49w83sj3km326srown64c7hlf1
>
>
> Thanks,
> --
> kou
>
> In <CA...@mail.gmail.com>
>   "New Pandas-Apache repo" on Wed, 18 Jan 2023 19:04:53 -0800,
>   Adesola Adedewe <av...@g.rit.edu> wrote:
>
> > https://github.com/ava6969/panda-arrow.git
> >
> > please review this new repo i made, writing nice wrappers over arrow
> > apache using panda interfaces. i'll love to add it to the main repo.
>

Re: New Pandas-Apache repo

Posted by Sutou Kouhei <ko...@clear-code.com>.

Hi,

I'm not sure pandas like API is suitable for our official
data frame API.

FYI:

* GitHub issue of this:
  https://github.com/apache/arrow/issues/33747
* [DISCUSS] Developing a "data frame" subproject in the Arrow C++ libraries
  https://lists.apache.org/thread/50vbmw49w83sj3km326srown64c7hlf1


Thanks,
-- 
kou

In <CA...@mail.gmail.com>
  "New Pandas-Apache repo" on Wed, 18 Jan 2023 19:04:53 -0800,
  Adesola Adedewe <av...@g.rit.edu> wrote:

> https://github.com/ava6969/panda-arrow.git
> 
> please review this new repo i made, writing nice wrappers over arrow
> apache using panda interfaces. i'll love to add it to the main repo.