You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@arrow.apache.org by Luke <vi...@gmail.com> on 2018/10/11 18:01:03 UTC

parquet file in S3, is there a way to read a subset of all the columns in python

I have parquet files (each self contained) in S3 and I want to read certain
columns into a pandas dataframe without reading the entire object out of
S3.

Is this implemented?  boto3 in python supports reading from offsets in an
S3 object but I wasn't sure anyone has made that work with a parquet file
corresponding to certain columns?

thanks,
Luke

Re: parquet file in S3, is there a way to read a subset of all the columns in python

Posted by Wes McKinney <we...@gmail.com>.

You should be able to use s3fs, both the file handles it creates as
well as a filesystem to read multifile datasets:

https://github.com/apache/arrow/blob/master/python/pyarrow/tests/test_parquet.py#L1441
On Fri, Oct 12, 2018 at 12:03 PM Luke <vi...@gmail.com> wrote:
>
> It looks like https://github.com/dask/s3fs implements these methods.  Would there need to be a wrapper over this for arrow or is it compatible as is?
>
> -Luke
>
> On Fri, Oct 12, 2018 at 9:13 AM Uwe L. Korn <uw...@xhochy.com> wrote:
>>
>> That looks nice. Once you have wrapped that in a class that implements read and seek like a Python file object, you should be able to pass this to `pyarrow.parquet.read_table`. When you then set the columns argument on that function, only the respective byte ranges are then requested from S3. To minimise the number of requests, I would suggest you to implement the S3 file with the exact ranges provided from the outside but when using pyarrow, you should wrap your S3 file in an io.BufferedReader. pyarrow.parquet requests exactly the ranges it needs but that can sometimes be too coarse for object stores like S3. There you often like to do the tradeoff of requesting some bytes more for a fewer number of requests.
>>
>> Uwe
>>
>>
>> On Thu, Oct 11, 2018, at 11:27 PM, Luke wrote:
>>
>> This works in boto3:
>>
>> import boto3
>>
>> obj = boto3.resource('s3').Object('mybucketfoo', 'foo')
>> stream = obj.get(Range='bytes=10-100')['Body']
>> print(stream.read())
>>
>>
>> On Thu, Oct 11, 2018 at 2:22 PM Uwe L. Korn <uw...@xhochy.com> wrote:
>>
>>
>> Hello Luke,
>>
>> this is only partly implemented. You can do this and I already did do this but this is sadly not in a perfect state.
>>
>> boto3 itself seems to be lacking a proper file-like class. You can get the contents of a file in S3 as https://botocore.amazonaws.com/v1/documentation/api/latest/reference/response.html#botocore.response.StreamingBody . This sadly seems to be missing a seek method.
>>
>> In my case I did access parquet files on S3 with per-column access using the simplekv project. There a small file-like class is implemented on top of boto (but not boto3): https://github.com/mbr/simplekv/blob/master/simplekv/net/botostore.py#L93 . This is what you are looking for, just the wrong boto package as well as I know that this implementation is sadly leaking http-connections and thus when you access too many files (even in serial) at once, your network will suffer.
>>
>> Cheers
>> Uwe
>>
>>
>> On Thu, Oct 11, 2018, at 8:01 PM, Luke wrote:
>>
>> I have parquet files (each self contained) in S3 and I want to read certain columns into a pandas dataframe without reading the entire object out of S3.
>>
>> Is this implemented?  boto3 in python supports reading from offsets in an S3 object but I wasn't sure anyone has made that work with a parquet file corresponding to certain columns?
>>
>> thanks,
>> Luke
>>
>>
>>

Re: parquet file in S3, is there a way to read a subset of all the columns in python

Posted by Luke <vi...@gmail.com>.

It looks like https://github.com/dask/s3fs implements these methods.  Would
there need to be a wrapper over this for arrow or is it compatible as is?

-Luke

On Fri, Oct 12, 2018 at 9:13 AM Uwe L. Korn <uw...@xhochy.com> wrote:

> That looks nice. Once you have wrapped that in a class that implements
> read and seek like a Python file object, you should be able to pass this to
> `pyarrow.parquet.read_table`. When you then set the columns argument on
> that function, only the respective byte ranges are then requested from S3.
> To minimise the number of requests, I would suggest you to implement the S3
> file with the exact ranges provided from the outside but when using
> pyarrow, you should wrap your S3 file in an io.BufferedReader.
> pyarrow.parquet requests exactly the ranges it needs but that can sometimes
> be too coarse for object stores like S3. There you often like to do the
> tradeoff of requesting some bytes more for a fewer number of requests.
>
> Uwe
>
>
> On Thu, Oct 11, 2018, at 11:27 PM, Luke wrote:
>
> This works in boto3:
>
> import boto3
>
> obj = boto3.resource('s3').Object('mybucketfoo', 'foo')
> stream = obj.get(Range='bytes=10-100')['Body']print(stream.read())
>
>
> On Thu, Oct 11, 2018 at 2:22 PM Uwe L. Korn <uw...@xhochy.com> wrote:
>
>
> Hello Luke,
>
> this is only partly implemented. You can do this and I already did do this
> but this is sadly not in a perfect state.
>
> boto3 itself seems to be lacking a proper file-like class. You can get the
> contents of a file in S3 as
> https://botocore.amazonaws.com/v1/documentation/api/latest/reference/response.html#botocore.response.StreamingBody .
> This sadly seems to be missing a seek method.
>
> In my case I did access parquet files on S3 with per-column access using
> the simplekv project. There a small file-like class is implemented on top
> of boto (but not boto3):
> https://github.com/mbr/simplekv/blob/master/simplekv/net/botostore.py#L93 .
> This is what you are looking for, just the wrong boto package as well as I
> know that this implementation is sadly leaking http-connections and thus
> when you access too many files (even in serial) at once, your network will
> suffer.
>
> Cheers
> Uwe
>
>
> On Thu, Oct 11, 2018, at 8:01 PM, Luke wrote:
>
> I have parquet files (each self contained) in S3 and I want to read
> certain columns into a pandas dataframe without reading the entire object
> out of S3.
>
> Is this implemented?  boto3 in python supports reading from offsets in an
> S3 object but I wasn't sure anyone has made that work with a parquet file
> corresponding to certain columns?
>
> thanks,
> Luke
>
>
>
>

Re: parquet file in S3, is there a way to read a subset of all the columns in python

Posted by "Uwe L. Korn" <uw...@xhochy.com>.

That looks nice. Once you have wrapped that in a class that implements
read and seek like a Python file object, you should be able to pass this
to `pyarrow.parquet.read_table`. When you then set the columns argument
on that function, only the respective byte ranges are then requested
from S3. To minimise the number of requests, I would suggest you to
implement the S3 file with the exact ranges provided from the outside
but when using pyarrow, you should wrap your S3 file in an
io.BufferedReader. pyarrow.parquet requests exactly the ranges it needs
but that can sometimes be too coarse for object stores like S3. There
you often like to do the tradeoff of requesting some bytes more for a
fewer number of requests.
Uwe


On Thu, Oct 11, 2018, at 11:27 PM, Luke wrote:
> This works in boto3:
> import boto3  obj = boto3.resource('s3').Object('mybucketfoo', 'foo')
> stream = obj.get(Range='bytes=10-100')['Body'] print(stream.read())> 
> On Thu, Oct 11, 2018 at 2:22 PM Uwe L. Korn <uw...@xhochy.com> wrote:>> __
>> Hello Luke,
>> 
>> this is only partly implemented. You can do this and I already did do
>> this but this is sadly not in a perfect state.>> 
>> boto3 itself seems to be lacking a proper file-like class. You can
>> get the contents of a file in S3 as
>> https://botocore.amazonaws.com/v1/documentation/api/latest/reference/response.html#botocore.response.StreamingBody
>> . This sadly seems to be missing a seek method.>> 
>> In my case I did access parquet files on S3 with per-column access
>> using the simplekv project. There a small file-like class is
>> implemented on top of boto (but not boto3):
>> https://github.com/mbr/simplekv/blob/master/simplekv/net/botostore.py#L93
>> . This is what you are looking for, just the wrong boto package as
>> well as I know that this implementation is sadly leaking http-
>> connections and thus when you access too many files (even in serial)
>> at once, your network will suffer.>> 
>> Cheers
>> Uwe
>> 
>> 
>> On Thu, Oct 11, 2018, at 8:01 PM, Luke wrote:
>>> I have parquet files (each self contained) in S3 and I want to read
>>> certain columns into a pandas dataframe without reading the entire
>>> object out of S3.>>> 
>>> Is this implemented?  boto3 in python supports reading from offsets
>>> in an S3 object but I wasn't sure anyone has made that work with a
>>> parquet file corresponding to certain columns?>>> 
>>> thanks,
>>> Luke
>>

Re: parquet file in S3, is there a way to read a subset of all the columns in python

Posted by Luke <vi...@gmail.com>.

This works in boto3:

import boto3

obj = boto3.resource('s3').Object('mybucketfoo', 'foo')
stream = obj.get(Range='bytes=10-100')['Body']print(stream.read())


On Thu, Oct 11, 2018 at 2:22 PM Uwe L. Korn <uw...@xhochy.com> wrote:

> Hello Luke,
>
> this is only partly implemented. You can do this and I already did do this
> but this is sadly not in a perfect state.
>
> boto3 itself seems to be lacking a proper file-like class. You can get the
> contents of a file in S3 as
> https://botocore.amazonaws.com/v1/documentation/api/latest/reference/response.html#botocore.response.StreamingBody .
> This sadly seems to be missing a seek method.
>
> In my case I did access parquet files on S3 with per-column access using
> the simplekv project. There a small file-like class is implemented on top
> of boto (but not boto3):
> https://github.com/mbr/simplekv/blob/master/simplekv/net/botostore.py#L93 .
> This is what you are looking for, just the wrong boto package as well as I
> know that this implementation is sadly leaking http-connections and thus
> when you access too many files (even in serial) at once, your network will
> suffer.
>
> Cheers
> Uwe
>
>
> On Thu, Oct 11, 2018, at 8:01 PM, Luke wrote:
>
> I have parquet files (each self contained) in S3 and I want to read
> certain columns into a pandas dataframe without reading the entire object
> out of S3.
>
> Is this implemented?  boto3 in python supports reading from offsets in an
> S3 object but I wasn't sure anyone has made that work with a parquet file
> corresponding to certain columns?
>
> thanks,
> Luke
>
>
>

Re: parquet file in S3, is there a way to read a subset of all the columns in python

Posted by "Uwe L. Korn" <uw...@xhochy.com>.

Hello Luke,

this is only partly implemented. You can do this and I already did do
this but this is sadly not in a perfect state.
boto3 itself seems to be lacking a proper file-like class. You can get
the contents of a file in S3 as
https://botocore.amazonaws.com/v1/documentation/api/latest/reference/response.html#botocore.response.StreamingBody
. This sadly seems to be missing a seek method.
In my case I did access parquet files on S3 with per-column access using
the simplekv project. There a small file-like class is implemented on
top of boto (but not boto3):
https://github.com/mbr/simplekv/blob/master/simplekv/net/botostore.py#L93
. This is what you are looking for, just the wrong boto package as well
as I know that this implementation is sadly leaking http-connections and
thus when you access too many files (even in serial) at once, your
network will suffer.
Cheers
Uwe

On Thu, Oct 11, 2018, at 8:01 PM, Luke wrote:
> I have parquet files (each self contained) in S3 and I want to read
> certain columns into a pandas dataframe without reading the entire
> object out of S3.> 
> Is this implemented?  boto3 in python supports reading from offsets in
> an S3 object but I wasn't sure anyone has made that work with a
> parquet file corresponding to certain columns?> 
> thanks,
> Luke