You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Eli <h5...@protonmail.ch> on 2018/02/01 13:27:57 UTC

Re: How to get "standard" binary columns out of a pyarrow table

Hey Wes,

I understand there's another pointer, a definition level pointer, which is basically a null location marker column. Exposing it as well to pick out the nulls would be awesome. 

The types of interest (to me) are varchars/strings, bools and numbers, just basic primitive types that also exist in standard SQL, so having these two columns available via Python would be sweet.


Thanks,
Eli

​
Sent with ProtonMail Secure Email.
​

-------- Original Message --------
 On January 31, 2018 4:06 PM, Wes McKinney  wrote:

>hi Eli,
>
> This isn't available at the moment, but one could make the internal
> buffers in an array accessible in Python. How would you handle nulls
> in this scenario (the bytes for a null value in a primitive array can
> be any value)? How would one handle things other than numbers?
>
> - Wes
>
> On Wed, Jan 31, 2018 at 5:14 AM, Eli h5rdly@protonmail.ch wrote:
>
>>Hey Wes,
>>What I meant by "standard" is the binary representation of a specific type aggregated together.
>>The int32 column [1,2,3] would make '\x01\x00\x00\x00\x02\x00\x00\x00\x03\x00\x00\x00' for example.
>>This is already available via Python's struct.pack(), array.array().tostring() or np.array().astype().tobytes()
>>What I was wondering is whatever that specific representation is already there in Arrow's C++ mechanics somewhere, and whether one can get hold of it from Pyarrow.
>>I don't know C++ very well, but I think what I'm looking for is in buffer.h, there are pointers to types under Buffer which I think point to just that.
>>I saw that Buffer is actually accessible via pa.lib.Buffer, and that it even has a to_pybytes() method.
>>However:
>> - I'm not sure those are the bytes that I speak of
>>
>> - I'm not sure how to use Buffer to find out, keep getting core dumps when trying
>>Sent with ProtonMail Secure Email.
>>-------- Original Message --------
>> On January 10, 2018 7:34 PM, Wes McKinney  wrote:
>>>hi Eli,
>>>I am not aware of any standards for binary columns (or at least, I
>>> don't know what "regular" means in this context) -- part of the
>>> purpose of the Apache Arrow project is to define a columnar standard
>>> in the absence of any existing one. Most database systems define their
>>> own custom wire protocols.
>>>Do you have a link to the specification for the binary protocol for
>>> the database you are using (or some other documentation)?
>>>Thanks,
>>> Wes
>>>On Wed, Jan 10, 2018 at 12:47 AM, Eli h5rdly@protonmail.ch wrote:
>>>>Hey Wes,
>>>> The database in question accepts columnar chunks of "regular" binary data over the network, one of the sources of which is parquet.
>>>> Thus, data only comes out of parquet on my side, and I was wondering how to get it out as "regular" binary columns. Something like tobytes() for an Arrow Column, or maybe read_asbytes() for pa itself. The purpose is to get to standard binary columns as fast as possible.
>>>> Thanks,
>>>> Eli
>>>> Sent with ProtonMail Secure Email.
>>>>>-------- Original Message --------
>>>>> Subject: Re: How to get "standard" binary columns out of a pyarrow table
>>>>> Local Time: January 10, 2018 5:32 AM
>>>>> UTC Time: January 10, 2018 3:32 AM
>>>>> From: wesmckinn@gmail.com
>>>>> To: dev@arrow.apache.org, Eli h5rdly@protonmail.ch
>>>>> hi Eli,
>>>>> I'm wondering what kind of API you would want, if the perfect one
>>>>> existed. If I understand correctly, you are embedding objects in a
>>>>> BYTE_ARRAY column in Parquet, and need to do some post-processing as
>>>>> the data goes in / comes out of Parquet?
>>>>> Thanks,
>>>>> Wes
>>>>> On Sat, Jan 6, 2018 at 8:37 AM, Eli h5rdly@protonmail.ch wrote:
>>>>>>Hi,
>>>>>> I'm looking to send "regular" columnar binary data to a database, the kind that gets created by struct.pack, array.array, numpy.tobytes or str.encode.
>>>>>> The origin is parquet files, which I'm reading ever so comfortably via PyArrow.
>>>>>> I do however need to deserialize to Python objcets, currently via to_pandas(), then re-serialize the columns with one of the above.
>>>>>> I was wondering whether there was a better way to go about it, one which would be most fast end effective.
>>>>>> Ideally I'd like to go through Python, but I can do C or even some C++ if necessary.
>>>>>> I posted the question on stackoverflow, and was asked to post here. Appreciate any feedback!
>>>>>> Thanks,
>>>>>> Eli
>>>>>> Sent with ProtonMail Secure Email.
>>>>>>
>>>>>
>>>>
>>>
>>
>


Re: How to get "standard" binary columns out of a pyarrow table

Posted by Eli <h5...@protonmail.ch>.
Can I perhaps assist? If I can get  a bit more specifics of what needs to be done, I think I can help. I'm ok with cython, looking at some C++ code etc.

​
Sent with ProtonMail Secure Email.
​

-------- Original Message --------
 On February 1, 2018 3:31 PM, Wes McKinney <we...@gmail.com> wrote:

>I opened https://issues.apache.org/jira/browse/ARROW-2068, which may
> help. This is an accessible issue for someone in the community to work
> on; I'm not sure when I'll be able to get to it.
>
> Thanks
> Wes
>
> On Thu, Feb 1, 2018 at 8:27 AM, Eli h5rdly@protonmail.ch wrote:
>>Hey Wes,
>>I understand there's another pointer, a definition level pointer, which is basically a null location marker column. Exposing it as well to pick out the nulls would be awesome.
>>The types of interest (to me) are varchars/strings, bools and numbers, just basic primitive types that also exist in standard SQL, so having these two columns available via Python would be sweet.
>>Thanks,
>> Eli
>>Sent with ProtonMail Secure Email.
>>-------- Original Message --------
>> On January 31, 2018 4:06 PM, Wes McKinney  wrote:
>>>hi Eli,
>>>This isn't available at the moment, but one could make the internal
>>> buffers in an array accessible in Python. How would you handle nulls
>>> in this scenario (the bytes for a null value in a primitive array can
>>> be any value)? How would one handle things other than numbers?
>>> - Wes
>>>On Wed, Jan 31, 2018 at 5:14 AM, Eli h5rdly@protonmail.ch wrote:
>>>>Hey Wes,
>>>> What I meant by "standard" is the binary representation of a specific type aggregated together.
>>>> The int32 column [1,2,3] would make '\x01\x00\x00\x00\x02\x00\x00\x00\x03\x00\x00\x00' for example.
>>>> This is already available via Python's struct.pack(), array.array().tostring() or np.array().astype().tobytes()
>>>> What I was wondering is whatever that specific representation is already there in Arrow's C++ mechanics somewhere, and whether one can get hold of it from Pyarrow.
>>>> I don't know C++ very well, but I think what I'm looking for is in buffer.h, there are pointers to types under Buffer which I think point to just that.
>>>> I saw that Buffer is actually accessible via pa.lib.Buffer, and that it even has a to_pybytes() method.
>>>> However:
>>>> - I'm not sure those are the bytes that I speak of
>>>>
>>>> - I'm not sure how to use Buffer to find out, keep getting core dumps when trying
>>>> Sent with ProtonMail Secure Email.
>>>> -------- Original Message --------
>>>> On January 10, 2018 7:34 PM, Wes McKinney  wrote:
>>>>
>>>>>hi Eli,
>>>>> I am not aware of any standards for binary columns (or at least, I
>>>>> don't know what "regular" means in this context) -- part of the
>>>>> purpose of the Apache Arrow project is to define a columnar standard
>>>>> in the absence of any existing one. Most database systems define their
>>>>> own custom wire protocols.
>>>>> Do you have a link to the specification for the binary protocol for
>>>>> the database you are using (or some other documentation)?
>>>>> Thanks,
>>>>> Wes
>>>>> On Wed, Jan 10, 2018 at 12:47 AM, Eli h5rdly@protonmail.ch wrote:
>>>>>>Hey Wes,
>>>>>> The database in question accepts columnar chunks of "regular" binary data over the network, one of the sources of which is parquet.
>>>>>> Thus, data only comes out of parquet on my side, and I was wondering how to get it out as "regular" binary columns. Something like tobytes() for an Arrow Column, or maybe read_asbytes() for pa itself. The purpose is to get to standard binary columns as fast as possible.
>>>>>> Thanks,
>>>>>> Eli
>>>>>> Sent with ProtonMail Secure Email.
>>>>>>>-------- Original Message --------
>>>>>>> Subject: Re: How to get "standard" binary columns out of a pyarrow table
>>>>>>> Local Time: January 10, 2018 5:32 AM
>>>>>>> UTC Time: January 10, 2018 3:32 AM
>>>>>>> From: wesmckinn@gmail.com
>>>>>>> To: dev@arrow.apache.org, Eli h5rdly@protonmail.ch
>>>>>>> hi Eli,
>>>>>>> I'm wondering what kind of API you would want, if the perfect one
>>>>>>> existed. If I understand correctly, you are embedding objects in a
>>>>>>> BYTE_ARRAY column in Parquet, and need to do some post-processing as
>>>>>>> the data goes in / comes out of Parquet?
>>>>>>> Thanks,
>>>>>>> Wes
>>>>>>> On Sat, Jan 6, 2018 at 8:37 AM, Eli h5rdly@protonmail.ch wrote:
>>>>>>>>Hi,
>>>>>>>> I'm looking to send "regular" columnar binary data to a database, the kind that gets created by struct.pack, array.array, numpy.tobytes or str.encode.
>>>>>>>> The origin is parquet files, which I'm reading ever so comfortably via PyArrow.
>>>>>>>> I do however need to deserialize to Python objcets, currently via to_pandas(), then re-serialize the columns with one of the above.
>>>>>>>> I was wondering whether there was a better way to go about it, one which would be most fast end effective.
>>>>>>>> Ideally I'd like to go through Python, but I can do C or even some C++ if necessary.
>>>>>>>> I posted the question on stackoverflow, and was asked to post here. Appreciate any feedback!
>>>>>>>> Thanks,
>>>>>>>> Eli
>>>>>>>> Sent with ProtonMail Secure Email.
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>


Re: How to get "standard" binary columns out of a pyarrow table

Posted by Wes McKinney <we...@gmail.com>.
I opened https://issues.apache.org/jira/browse/ARROW-2068, which may
help. This is an accessible issue for someone in the community to work
on; I'm not sure when I'll be able to get to it.

Thanks
Wes

On Thu, Feb 1, 2018 at 8:27 AM, Eli <h5...@protonmail.ch> wrote:
> Hey Wes,
>
> I understand there's another pointer, a definition level pointer, which is basically a null location marker column. Exposing it as well to pick out the nulls would be awesome.
>
> The types of interest (to me) are varchars/strings, bools and numbers, just basic primitive types that also exist in standard SQL, so having these two columns available via Python would be sweet.
>
>
> Thanks,
> Eli
>
>
> Sent with ProtonMail Secure Email.
>
>
> -------- Original Message --------
>  On January 31, 2018 4:06 PM, Wes McKinney  wrote:
>
>>hi Eli,
>>
>> This isn't available at the moment, but one could make the internal
>> buffers in an array accessible in Python. How would you handle nulls
>> in this scenario (the bytes for a null value in a primitive array can
>> be any value)? How would one handle things other than numbers?
>>
>> - Wes
>>
>> On Wed, Jan 31, 2018 at 5:14 AM, Eli h5rdly@protonmail.ch wrote:
>>
>>>Hey Wes,
>>>What I meant by "standard" is the binary representation of a specific type aggregated together.
>>>The int32 column [1,2,3] would make '\x01\x00\x00\x00\x02\x00\x00\x00\x03\x00\x00\x00' for example.
>>>This is already available via Python's struct.pack(), array.array().tostring() or np.array().astype().tobytes()
>>>What I was wondering is whatever that specific representation is already there in Arrow's C++ mechanics somewhere, and whether one can get hold of it from Pyarrow.
>>>I don't know C++ very well, but I think what I'm looking for is in buffer.h, there are pointers to types under Buffer which I think point to just that.
>>>I saw that Buffer is actually accessible via pa.lib.Buffer, and that it even has a to_pybytes() method.
>>>However:
>>> - I'm not sure those are the bytes that I speak of
>>>
>>> - I'm not sure how to use Buffer to find out, keep getting core dumps when trying
>>>Sent with ProtonMail Secure Email.
>>>-------- Original Message --------
>>> On January 10, 2018 7:34 PM, Wes McKinney  wrote:
>>>>hi Eli,
>>>>I am not aware of any standards for binary columns (or at least, I
>>>> don't know what "regular" means in this context) -- part of the
>>>> purpose of the Apache Arrow project is to define a columnar standard
>>>> in the absence of any existing one. Most database systems define their
>>>> own custom wire protocols.
>>>>Do you have a link to the specification for the binary protocol for
>>>> the database you are using (or some other documentation)?
>>>>Thanks,
>>>> Wes
>>>>On Wed, Jan 10, 2018 at 12:47 AM, Eli h5rdly@protonmail.ch wrote:
>>>>>Hey Wes,
>>>>> The database in question accepts columnar chunks of "regular" binary data over the network, one of the sources of which is parquet.
>>>>> Thus, data only comes out of parquet on my side, and I was wondering how to get it out as "regular" binary columns. Something like tobytes() for an Arrow Column, or maybe read_asbytes() for pa itself. The purpose is to get to standard binary columns as fast as possible.
>>>>> Thanks,
>>>>> Eli
>>>>> Sent with ProtonMail Secure Email.
>>>>>>-------- Original Message --------
>>>>>> Subject: Re: How to get "standard" binary columns out of a pyarrow table
>>>>>> Local Time: January 10, 2018 5:32 AM
>>>>>> UTC Time: January 10, 2018 3:32 AM
>>>>>> From: wesmckinn@gmail.com
>>>>>> To: dev@arrow.apache.org, Eli h5rdly@protonmail.ch
>>>>>> hi Eli,
>>>>>> I'm wondering what kind of API you would want, if the perfect one
>>>>>> existed. If I understand correctly, you are embedding objects in a
>>>>>> BYTE_ARRAY column in Parquet, and need to do some post-processing as
>>>>>> the data goes in / comes out of Parquet?
>>>>>> Thanks,
>>>>>> Wes
>>>>>> On Sat, Jan 6, 2018 at 8:37 AM, Eli h5rdly@protonmail.ch wrote:
>>>>>>>Hi,
>>>>>>> I'm looking to send "regular" columnar binary data to a database, the kind that gets created by struct.pack, array.array, numpy.tobytes or str.encode.
>>>>>>> The origin is parquet files, which I'm reading ever so comfortably via PyArrow.
>>>>>>> I do however need to deserialize to Python objcets, currently via to_pandas(), then re-serialize the columns with one of the above.
>>>>>>> I was wondering whether there was a better way to go about it, one which would be most fast end effective.
>>>>>>> Ideally I'd like to go through Python, but I can do C or even some C++ if necessary.
>>>>>>> I posted the question on stackoverflow, and was asked to post here. Appreciate any feedback!
>>>>>>> Thanks,
>>>>>>> Eli
>>>>>>> Sent with ProtonMail Secure Email.
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>