You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@arrow.apache.org by "Nugent, Daniel" <Da...@mlp.com> on 2020/03/11 23:06:26 UTC

RE: [EXTERNAL] Re: Question about memoryviews and array construction

Thanks for closing this out!

Sorry I didn't get around to working on this before you ended up putting it in. I had some difficulty getting the dev environment set up and limited time to work on it.

Is there a list of good first issues to take a crack at? I've really appreciated the project overall and would like to help out in the time I can.

-Dan Nugent

-----Original Message-----
From: Wes McKinney <we...@gmail.com> 
Sent: Saturday, March 7, 2020 10:55 AM
To: user@arrow.apache.org
Subject: [EXTERNAL] Re: Question about memoryviews and array construction

There's a couple places to start

* Add PyMemoryView type check to internal::IsPyBinary https://github.com/apache/arrow/blob/master/cpp/src/arrow/python/helpers.h#L80.
I think this is all that's needed to take care of type inference
* Make sure PyMemoryView is handled in the PyBytesView helper in
https://github.com/apache/arrow/blob/master/cpp/src/arrow/python/common.h#L193

On Sat, Mar 7, 2020 at 9:35 AM Daniel Nugent <nu...@gmail.com> wrote:
>
> Great!
>
> If you could provide a smidgen of guidance about where to start making this change, I would be happy to give it a shot.
>
> Thanks,
>
> -Dan Nugent
> On Mar 7, 2020, 09:18 -0500, Wes McKinney <we...@gmail.com>, wrote:
>
> hi Dan,
>
> Yes, we should support constructing StringArray directly from 
> memoryview as we do with bytes and unicode -- you're the first person 
> to ask about this so far. I opened 
> https://issues.apache.org/jira/browse/ARROW-8026. This should not be a 
> huge amount of work so would be a good first contribution to the 
> project
>
> Thanks
>
> Wes
>
> On Fri, Mar 6, 2020 at 8:29 PM Nugent, Daniel <Da...@mlp.com> wrote:
>
>
> Hi,
>
>
>
> I have a short program which I’m wondering about the sensibility of. Could anyone let me know if this is reasonable or not:
>
>
>
> import pyarrow as pa, third_party_library
>
>
> memory_views = third_party_library.get_strings()
>
>
> memory_views
>
>
> [<memory at 0x7f1745cc0870>, <memory at 0x7f1745cc0940>, <memory at 
> 0x7f1745cc0a10>, <memory at 0x7f1745cc0ae0>]
>
> pa.array(memory_views,pa.string())
>
>
> Traceback (most recent call last):
>
> File "<stdin>", line 1, in <module>
>
> File "pyarrow/array.pxi", line 269, in pyarrow.lib.array
>
> File "pyarrow/array.pxi", line 38, in pyarrow.lib._sequence_to_array
>
> File "pyarrow/error.pxi", line 107, in pyarrow.lib.check_status
>
> pyarrow.lib.ArrowTypeError: Expected a string or bytes object, got a 
> 'memoryview' object
>
> pa.array(map(bytes,memory_views),pa.string())
>
>
> <pyarrow.lib.StringArray object at 0x7f1745cbdd00>
>
> [
>
> "this",
>
> "is",
>
> "a",
>
> "sample"
>
> ]
>
>
>
> I have a big list of byte sequences being provided to me as memoryviews from a third party library. I’d like to create an Arrow StringArray from them as efficiently as possible. Having to map and consequently copy them through a bytes constructor seems not great (and the memoryview tobytes function appears to just call the bytes constructor, afaict).
>
>
>
> To me, it seemed like pa.array should be able to use the memoryview objects directly in order to construct the StringArray, but it seems like Arrow wants them copied into fresh byte objects first. I don’t know if I understand why and was ultimately wondering if it’s a reasonable thing to desire.
>
>
>
> Thanks in advance,
>
> -Dan Nugent
>
>
>
>
> ######################################################################
>
> The information contained in this communication is confidential and
>
> may contain information that is privileged or exempt from disclosure
>
> under applicable law. If you are not a named addressee, please notify
>
> the sender immediately and delete this email from your system.
>
> If you have received this communication, and are not a named
>
> recipient, you are hereby notified that any dissemination,
>
> distribution or copying of this communication is strictly prohibited.
>
> ######################################################################


######################################################################

The information contained in this communication is confidential and

may contain information that is privileged or exempt from disclosure

under applicable law. If you are not a named addressee, please notify

the sender immediately and delete this email from your system.

If you have received this communication, and are not a named

recipient, you are hereby notified that any dissemination,

distribution or copying of this communication is strictly prohibited.

######################################################################

Re: [EXTERNAL] Re: Question about memoryviews and array construction

Posted by Wes McKinney <we...@gmail.com>.
The one you just opened seems like a good first issue

https://issues.apache.org/jira/browse/ARROW-8070

If you follow the instructions in
https://github.com/apache/arrow/blob/master/docs/source/developers/python.rst
and can't get thing to build please let us know the details so we can
help you

On Wed, Mar 11, 2020 at 6:06 PM Nugent, Daniel <Da...@mlp.com> wrote:
>
> Thanks for closing this out!
>
> Sorry I didn't get around to working on this before you ended up putting it in. I had some difficulty getting the dev environment set up and limited time to work on it.
>
> Is there a list of good first issues to take a crack at? I've really appreciated the project overall and would like to help out in the time I can.
>
> -Dan Nugent
>
> -----Original Message-----
> From: Wes McKinney <we...@gmail.com>
> Sent: Saturday, March 7, 2020 10:55 AM
> To: user@arrow.apache.org
> Subject: [EXTERNAL] Re: Question about memoryviews and array construction
>
> There's a couple places to start
>
> * Add PyMemoryView type check to internal::IsPyBinary https://github.com/apache/arrow/blob/master/cpp/src/arrow/python/helpers.h#L80.
> I think this is all that's needed to take care of type inference
> * Make sure PyMemoryView is handled in the PyBytesView helper in
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/python/common.h#L193
>
> On Sat, Mar 7, 2020 at 9:35 AM Daniel Nugent <nu...@gmail.com> wrote:
> >
> > Great!
> >
> > If you could provide a smidgen of guidance about where to start making this change, I would be happy to give it a shot.
> >
> > Thanks,
> >
> > -Dan Nugent
> > On Mar 7, 2020, 09:18 -0500, Wes McKinney <we...@gmail.com>, wrote:
> >
> > hi Dan,
> >
> > Yes, we should support constructing StringArray directly from
> > memoryview as we do with bytes and unicode -- you're the first person
> > to ask about this so far. I opened
> > https://issues.apache.org/jira/browse/ARROW-8026. This should not be a
> > huge amount of work so would be a good first contribution to the
> > project
> >
> > Thanks
> >
> > Wes
> >
> > On Fri, Mar 6, 2020 at 8:29 PM Nugent, Daniel <Da...@mlp.com> wrote:
> >
> >
> > Hi,
> >
> >
> >
> > I have a short program which I’m wondering about the sensibility of. Could anyone let me know if this is reasonable or not:
> >
> >
> >
> > import pyarrow as pa, third_party_library
> >
> >
> > memory_views = third_party_library.get_strings()
> >
> >
> > memory_views
> >
> >
> > [<memory at 0x7f1745cc0870>, <memory at 0x7f1745cc0940>, <memory at
> > 0x7f1745cc0a10>, <memory at 0x7f1745cc0ae0>]
> >
> > pa.array(memory_views,pa.string())
> >
> >
> > Traceback (most recent call last):
> >
> > File "<stdin>", line 1, in <module>
> >
> > File "pyarrow/array.pxi", line 269, in pyarrow.lib.array
> >
> > File "pyarrow/array.pxi", line 38, in pyarrow.lib._sequence_to_array
> >
> > File "pyarrow/error.pxi", line 107, in pyarrow.lib.check_status
> >
> > pyarrow.lib.ArrowTypeError: Expected a string or bytes object, got a
> > 'memoryview' object
> >
> > pa.array(map(bytes,memory_views),pa.string())
> >
> >
> > <pyarrow.lib.StringArray object at 0x7f1745cbdd00>
> >
> > [
> >
> > "this",
> >
> > "is",
> >
> > "a",
> >
> > "sample"
> >
> > ]
> >
> >
> >
> > I have a big list of byte sequences being provided to me as memoryviews from a third party library. I’d like to create an Arrow StringArray from them as efficiently as possible. Having to map and consequently copy them through a bytes constructor seems not great (and the memoryview tobytes function appears to just call the bytes constructor, afaict).
> >
> >
> >
> > To me, it seemed like pa.array should be able to use the memoryview objects directly in order to construct the StringArray, but it seems like Arrow wants them copied into fresh byte objects first. I don’t know if I understand why and was ultimately wondering if it’s a reasonable thing to desire.
> >
> >
> >
> > Thanks in advance,
> >
> > -Dan Nugent
> >
> >
> >
> >
> > ######################################################################
> >
> > The information contained in this communication is confidential and
> >
> > may contain information that is privileged or exempt from disclosure
> >
> > under applicable law. If you are not a named addressee, please notify
> >
> > the sender immediately and delete this email from your system.
> >
> > If you have received this communication, and are not a named
> >
> > recipient, you are hereby notified that any dissemination,
> >
> > distribution or copying of this communication is strictly prohibited.
> >
> > ######################################################################
>
>
> ######################################################################
>
> The information contained in this communication is confidential and
>
> may contain information that is privileged or exempt from disclosure
>
> under applicable law. If you are not a named addressee, please notify
>
> the sender immediately and delete this email from your system.
>
> If you have received this communication, and are not a named
>
> recipient, you are hereby notified that any dissemination,
>
> distribution or copying of this communication is strictly prohibited.
>
> ######################################################################