You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@arrow.apache.org by Eli <h5...@protonmail.ch> on 2018/02/01 13:27:57 UTC

Re: How to get "standard" binary columns out of a pyarrow table

Hey Wes,

I understand there's another pointer, a definition level pointer, which is basically a null location marker column. Exposing it as well to pick out the nulls would be awesome. 

The types of interest (to me) are varchars/strings, bools and numbers, just basic primitive types that also exist in standard SQL, so having these two columns available via Python would be sweet.


Thanks,
Eli


Sent with ProtonMail Secure Email.


-------- Original Message --------
 On January 31, 2018 4:06 PM, Wes McKinney  wrote:

>hi Eli,
>
> This isn't available at the moment, but one could make the internal
> buffers in an array accessible in Python. How would you handle nulls
> in this scenario (the bytes for a null value in a primitive array can
> be any value)? How would one handle things other than numbers?
>
> - Wes
>
> On Wed, Jan 31, 2018 at 5:14 AM, Eli h5rdly@protonmail.ch wrote:
>
>>Hey Wes,
>>What I meant by "standard" is the binary representation of a specific type aggregated together.
>>The int32 column [1,2,3] would make '\x01\x00\x00\x00\x02\x00\x00\x00\x03\x00\x00\x00' for example.
>>This is already available via Python's struct.pack(), array.array().tostring() or np.array().astype().tobytes()
>>What I was wondering is whatever that specific representation is already there in Arrow's C++ mechanics somewhere, and whether one can get hold of it from Pyarrow.
>>I don't know C++ very well, but I think what I'm looking for is in buffer.h, there are pointers to types under Buffer which I think point to just that.
>>I saw that Buffer is actually accessible via pa.lib.Buffer, and that it even has a to_pybytes() method.
>>However:
>> - I'm not sure those are the bytes that I speak of
>>
>> - I'm not sure how to use Buffer to find out, keep getting core dumps when trying
>>Sent with ProtonMail Secure Email.
>>-------- Original Message --------
>> On January 10, 2018 7:34 PM, Wes McKinney  wrote:
>>>hi Eli,
>>>I am not aware of any standards for binary columns (or at least, I
>>> don't know what "regular" means in this context) -- part of the
>>> purpose of the Apache Arrow project is to define a columnar standard
>>> in the absence of any existing one. Most database systems define their
>>> own custom wire protocols.
>>>Do you have a link to the specification for the binary protocol for
>>> the database you are using (or some other documentation)?
>>>Thanks,
>>> Wes
>>>On Wed, Jan 10, 2018 at 12:47 AM, Eli h5rdly@protonmail.ch wrote:
>>>>Hey Wes,
>>>> The database in question accepts columnar chunks of "regular" binary data over the network, one of the sources of which is parquet.
>>>> Thus, data only comes out of parquet on my side, and I was wondering how to get it out as "regular" binary columns. Something like tobytes() for an Arrow Column, or maybe read_asbytes() for pa itself. The purpose is to get to standard binary columns as fast as possible.
>>>> Thanks,
>>>> Eli
>>>> Sent with ProtonMail Secure Email.
>>>>>-------- Original Message --------
>>>>> Subject: Re: How to get "standard" binary columns out of a pyarrow table
>>>>> Local Time: January 10, 2018 5:32 AM
>>>>> UTC Time: January 10, 2018 3:32 AM
>>>>> From: wesmckinn@gmail.com
>>>>> To: dev@arrow.apache.org, Eli h5rdly@protonmail.ch
>>>>> hi Eli,
>>>>> I'm wondering what kind of API you would want, if the perfect one
>>>>> existed. If I understand correctly, you are embedding objects in a
>>>>> BYTE_ARRAY column in Parquet, and need to do some post-processing as
>>>>> the data goes in / comes out of Parquet?
>>>>> Thanks,
>>>>> Wes
>>>>> On Sat, Jan 6, 2018 at 8:37 AM, Eli h5rdly@protonmail.ch wrote:
>>>>>>Hi,
>>>>>> I'm looking to send "regular" columnar binary data to a database, the kind that gets created by struct.pack, array.array, numpy.tobytes or str.encode.
>>>>>> The origin is parquet files, which I'm reading ever so comfortably via PyArrow.
>>>>>> I do however need to deserialize to Python objcets, currently via to_pandas(), then re-serialize the columns with one of the above.
>>>>>> I was wondering whether there was a better way to go about it, one which would be most fast end effective.
>>>>>> Ideally I'd like to go through Python, but I can do C or even some C++ if necessary.
>>>>>> I posted the question on stackoverflow, and was asked to post here. Appreciate any feedback!
>>>>>> Thanks,
>>>>>> Eli
>>>>>> Sent with ProtonMail Secure Email.
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: How to get "standard" binary columns out of a pyarrow table

Posted by Eli <h5...@protonmail.ch>.

Can I perhaps assist? If I can get  a bit more specifics of what needs to be done, I think I can help. I'm ok with cython, looking at some C++ code etc.


Sent with ProtonMail Secure Email.


-------- Original Message --------
 On February 1, 2018 3:31 PM, Wes McKinney <we...@gmail.com> wrote:

>I opened https://issues.apache.org/jira/browse/ARROW-2068, which may
> help. This is an accessible issue for someone in the community to work
> on; I'm not sure when I'll be able to get to it.
>
> Thanks
> Wes
>
> On Thu, Feb 1, 2018 at 8:27 AM, Eli h5rdly@protonmail.ch wrote:
>>Hey Wes,
>>I understand there's another pointer, a definition level pointer, which is basically a null location marker column. Exposing it as well to pick out the nulls would be awesome.
>>The types of interest (to me) are varchars/strings, bools and numbers, just basic primitive types that also exist in standard SQL, so having these two columns available via Python would be sweet.
>>Thanks,
>> Eli
>>Sent with ProtonMail Secure Email.
>>-------- Original Message --------
>> On January 31, 2018 4:06 PM, Wes McKinney  wrote:
>>>hi Eli,
>>>This isn't available at the moment, but one could make the internal
>>> buffers in an array accessible in Python. How would you handle nulls
>>> in this scenario (the bytes for a null value in a primitive array can
>>> be any value)? How would one handle things other than numbers?
>>> - Wes
>>>On Wed, Jan 31, 2018 at 5:14 AM, Eli h5rdly@protonmail.ch wrote:
>>>>Hey Wes,
>>>> What I meant by "standard" is the binary representation of a specific type aggregated together.
>>>> The int32 column [1,2,3] would make '\x01\x00\x00\x00\x02\x00\x00\x00\x03\x00\x00\x00' for example.
>>>> This is already available via Python's struct.pack(), array.array().tostring() or np.array().astype().tobytes()
>>>> What I was wondering is whatever that specific representation is already there in Arrow's C++ mechanics somewhere, and whether one can get hold of it from Pyarrow.
>>>> I don't know C++ very well, but I think what I'm looking for is in buffer.h, there are pointers to types under Buffer which I think point to just that.
>>>> I saw that Buffer is actually accessible via pa.lib.Buffer, and that it even has a to_pybytes() method.
>>>> However:
>>>> - I'm not sure those are the bytes that I speak of
>>>>
>>>> - I'm not sure how to use Buffer to find out, keep getting core dumps when trying
>>>> Sent with ProtonMail Secure Email.
>>>> -------- Original Message --------
>>>> On January 10, 2018 7:34 PM, Wes McKinney  wrote:
>>>>
>>>>>hi Eli,
>>>>> I am not aware of any standards for binary columns (or at least, I
>>>>> don't know what "regular" means in this context) -- part of the
>>>>> purpose of the Apache Arrow project is to define a columnar standard
>>>>> in the absence of any existing one. Most database systems define their
>>>>> own custom wire protocols.
>>>>> Do you have a link to the specification for the binary protocol for
>>>>> the database you are using (or some other documentation)?
>>>>> Thanks,
>>>>> Wes
>>>>> On Wed, Jan 10, 2018 at 12:47 AM, Eli h5rdly@protonmail.ch wrote:
>>>>>>Hey Wes,
>>>>>> The database in question accepts columnar chunks of "regular" binary data over the network, one of the sources of which is parquet.
>>>>>> Thus, data only comes out of parquet on my side, and I was wondering how to get it out as "regular" binary columns. Something like tobytes() for an Arrow Column, or maybe read_asbytes() for pa itself. The purpose is to get to standard binary columns as fast as possible.
>>>>>> Thanks,
>>>>>> Eli
>>>>>> Sent with ProtonMail Secure Email.
>>>>>>>-------- Original Message --------
>>>>>>> Subject: Re: How to get "standard" binary columns out of a pyarrow table
>>>>>>> Local Time: January 10, 2018 5:32 AM
>>>>>>> UTC Time: January 10, 2018 3:32 AM
>>>>>>> From: wesmckinn@gmail.com
>>>>>>> To: dev@arrow.apache.org, Eli h5rdly@protonmail.ch
>>>>>>> hi Eli,
>>>>>>> I'm wondering what kind of API you would want, if the perfect one
>>>>>>> existed. If I understand correctly, you are embedding objects in a
>>>>>>> BYTE_ARRAY column in Parquet, and need to do some post-processing as
>>>>>>> the data goes in / comes out of Parquet?
>>>>>>> Thanks,
>>>>>>> Wes
>>>>>>> On Sat, Jan 6, 2018 at 8:37 AM, Eli h5rdly@protonmail.ch wrote:
>>>>>>>>Hi,
>>>>>>>> I'm looking to send "regular" columnar binary data to a database, the kind that gets created by struct.pack, array.array, numpy.tobytes or str.encode.
>>>>>>>> The origin is parquet files, which I'm reading ever so comfortably via PyArrow.
>>>>>>>> I do however need to deserialize to Python objcets, currently via to_pandas(), then re-serialize the columns with one of the above.
>>>>>>>> I was wondering whether there was a better way to go about it, one which would be most fast end effective.
>>>>>>>> Ideally I'd like to go through Python, but I can do C or even some C++ if necessary.
>>>>>>>> I posted the question on stackoverflow, and was asked to post here. Appreciate any feedback!
>>>>>>>> Thanks,
>>>>>>>> Eli
>>>>>>>> Sent with ProtonMail Secure Email.
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: How to get "standard" binary columns out of a pyarrow table

Posted by Wes McKinney <we...@gmail.com>.

I opened https://issues.apache.org/jira/browse/ARROW-2068, which may
help. This is an accessible issue for someone in the community to work
on; I'm not sure when I'll be able to get to it.

Thanks
Wes

On Thu, Feb 1, 2018 at 8:27 AM, Eli <h5...@protonmail.ch> wrote:
> Hey Wes,
>
> I understand there's another pointer, a definition level pointer, which is basically a null location marker column. Exposing it as well to pick out the nulls would be awesome.
>
> The types of interest (to me) are varchars/strings, bools and numbers, just basic primitive types that also exist in standard SQL, so having these two columns available via Python would be sweet.
>
>
> Thanks,
> Eli
>
>
> Sent with ProtonMail Secure Email.
>
>
> -------- Original Message --------
>  On January 31, 2018 4:06 PM, Wes McKinney  wrote:
>
>>hi Eli,
>>
>> This isn't available at the moment, but one could make the internal
>> buffers in an array accessible in Python. How would you handle nulls
>> in this scenario (the bytes for a null value in a primitive array can
>> be any value)? How would one handle things other than numbers?
>>
>> - Wes
>>
>> On Wed, Jan 31, 2018 at 5:14 AM, Eli h5rdly@protonmail.ch wrote:
>>
>>>Hey Wes,
>>>What I meant by "standard" is the binary representation of a specific type aggregated together.
>>>The int32 column [1,2,3] would make '\x01\x00\x00\x00\x02\x00\x00\x00\x03\x00\x00\x00' for example.
>>>This is already available via Python's struct.pack(), array.array().tostring() or np.array().astype().tobytes()
>>>What I was wondering is whatever that specific representation is already there in Arrow's C++ mechanics somewhere, and whether one can get hold of it from Pyarrow.
>>>I don't know C++ very well, but I think what I'm looking for is in buffer.h, there are pointers to types under Buffer which I think point to just that.
>>>I saw that Buffer is actually accessible via pa.lib.Buffer, and that it even has a to_pybytes() method.
>>>However:
>>> - I'm not sure those are the bytes that I speak of
>>>
>>> - I'm not sure how to use Buffer to find out, keep getting core dumps when trying
>>>Sent with ProtonMail Secure Email.
>>>-------- Original Message --------
>>> On January 10, 2018 7:34 PM, Wes McKinney  wrote:
>>>>hi Eli,
>>>>I am not aware of any standards for binary columns (or at least, I
>>>> don't know what "regular" means in this context) -- part of the
>>>> purpose of the Apache Arrow project is to define a columnar standard
>>>> in the absence of any existing one. Most database systems define their
>>>> own custom wire protocols.
>>>>Do you have a link to the specification for the binary protocol for
>>>> the database you are using (or some other documentation)?
>>>>Thanks,
>>>> Wes
>>>>On Wed, Jan 10, 2018 at 12:47 AM, Eli h5rdly@protonmail.ch wrote:
>>>>>Hey Wes,
>>>>> The database in question accepts columnar chunks of "regular" binary data over the network, one of the sources of which is parquet.
>>>>> Thus, data only comes out of parquet on my side, and I was wondering how to get it out as "regular" binary columns. Something like tobytes() for an Arrow Column, or maybe read_asbytes() for pa itself. The purpose is to get to standard binary columns as fast as possible.
>>>>> Thanks,
>>>>> Eli
>>>>> Sent with ProtonMail Secure Email.
>>>>>>-------- Original Message --------
>>>>>> Subject: Re: How to get "standard" binary columns out of a pyarrow table
>>>>>> Local Time: January 10, 2018 5:32 AM
>>>>>> UTC Time: January 10, 2018 3:32 AM
>>>>>> From: wesmckinn@gmail.com
>>>>>> To: dev@arrow.apache.org, Eli h5rdly@protonmail.ch
>>>>>> hi Eli,
>>>>>> I'm wondering what kind of API you would want, if the perfect one
>>>>>> existed. If I understand correctly, you are embedding objects in a
>>>>>> BYTE_ARRAY column in Parquet, and need to do some post-processing as
>>>>>> the data goes in / comes out of Parquet?
>>>>>> Thanks,
>>>>>> Wes
>>>>>> On Sat, Jan 6, 2018 at 8:37 AM, Eli h5rdly@protonmail.ch wrote:
>>>>>>>Hi,
>>>>>>> I'm looking to send "regular" columnar binary data to a database, the kind that gets created by struct.pack, array.array, numpy.tobytes or str.encode.
>>>>>>> The origin is parquet files, which I'm reading ever so comfortably via PyArrow.
>>>>>>> I do however need to deserialize to Python objcets, currently via to_pandas(), then re-serialize the columns with one of the above.
>>>>>>> I was wondering whether there was a better way to go about it, one which would be most fast end effective.
>>>>>>> Ideally I'd like to go through Python, but I can do C or even some C++ if necessary.
>>>>>>> I posted the question on stackoverflow, and was asked to post here. Appreciate any feedback!
>>>>>>> Thanks,
>>>>>>> Eli
>>>>>>> Sent with ProtonMail Secure Email.
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>