You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@arrow.apache.org by Cindy McMullen <cm...@twitter.com> on 2021/12/23 05:16:58 UTC

ParquetFile API and GCS file

Hi -

I need to drop down to the ParquetFile API so I can have better control
over batch size for reading huge Parquet files.  The filename is:

*gs://graph_infra_steel_thread/output_pq/parquet/usersims/output-20211202-220329-20211202-220329-00-0012.parquet.snappy*

This invocation fails:
*pqf = pq.ParquetFile(filename)*
"FileNotFoundError: [Errno 2] Failed to open local file
'gs://graph_infra_steel_thread/output_pq/parquet/usersims/output-20211202-220329-20211202-220329-00-0012.parquet.snappy'.
Detail: [errno 2] No such file or directory"

While this API, using the same, succeeds because I can specify 'gs'
filesystem.
*table = pq.read_table(filename, filesystem=gs, use_legacy_dataset=False) *

I don't see a way to specify 'filesystem' on the ParquetFile API
<https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetFile.html#pyarrow.parquet.ParquetFile>.
Is there any way to read a GCS file using ParquetFile?

If not, can you show me the code for reading batches using pq.read_table or
one of the other Arrow Parquet APIs
<https://arrow.apache.org/docs/python/api/formats.html#parquet-files>?

Thanks -

-- Cindy

Re: ParquetFile API and GCS file

Posted by Cindy McMullen <cm...@twitter.com>.
Found a good example
<https://stackoverflow.com/questions/68048816/how-can-i-process-a-large-parquet-file-from-spark-in-numpy-pandas>
on StackOverflow:

batches = pq_file.iter_batches(batch_size, use_pandas_metadata=True) #
batches will be a generator    for batch in batches:
    df = batch.to_pandas()
    process(df)


On Mon, Dec 27, 2021 at 3:39 PM Cindy McMullen <cm...@twitter.com>
wrote:

> Can you give an example of using the ParquteFile.iter_batches() API?  I
> can see it returns a 'generator' class, but not sure how to iterate over
> the results to get at the underlying row data.
>
> On Mon, Dec 27, 2021 at 11:33 AM David Li <li...@apache.org> wrote:
>
>> Ah, I'm sorry. I misremembered, I was recalling the implementation of
>> ReadOneRowGroup/ReadRowGroups, but iter_batches() boils down to
>> GetRecordBatchReader which does read at a finer granularity.
>>
>> -David
>>
>> On Mon, Dec 27, 2021, at 11:50, Micah Kornfield wrote:
>>
>> Just a shot in the dark, but how many row groups are there in that 1 GB
>> file? IIRC, the reader loads an entire row group's worth of rows at once.
>>
>>
>> Can you clarify what you mean by "loads" I thought it only loaded the
>> compressed data at once, and then read per page (I could be misremembering
>> or thinking this was an aspirational goal).
>>
>> On Fri, Dec 24, 2021 at 4:01 PM Partha Dutta <pa...@gmail.com>
>> wrote:
>>
>> I see 5 row groups. This parquet file contains 1.8 million records
>>
>> On Fri, Dec 24, 2021 at 4:51 PM David Li <li...@apache.org> wrote:
>>
>>
>> Just a shot in the dark, but how many row groups are there in that 1 GB
>> file? IIRC, the reader loads an entire row group's worth of rows at once.
>>
>>
>> -David
>>
>> On Fri, Dec 24, 2021, at 17:45, Partha Dutta wrote:
>>
>> I have a similar issue with trying to read huge 1GB parquet files from
>> Azure DataLake Storage. I'm trying to read the file in small chunks using
>> the ParquetFile.iter_batches method, but it seems like the entire file is
>> read into memory before the first batch is returned. I am using the Azure
>> SDK for python and another python package (pyarrowfs-adlgen2). Has anyone
>> faced a problem similar to what I am seeing, or is there a workaround?
>>
>> On Fri, Dec 24, 2021 at 2:11 PM Cindy McMullen <cm...@twitter.com>
>> wrote:
>>
>> Thanks, Arthur, this helps.  The complete code example is:
>>
>> filename = 'gs://' + files[0]
>> gs = gcsfs.GCSFileSystem()
>> f = gs.open(filename)
>> pqf = pq.ParquetFile(f)
>> pqf.metadata
>>
>>
>> On Thu, Dec 23, 2021 at 1:48 AM Arthur Andres <ar...@gmail.com>
>> wrote:
>>
>> Hi Cindy,
>>
>> In your case you'd have to pass a GCS file instance to the ParquetFile
>> constructor. Something like this:
>>
>> source = fs.open_input_file(filename)
>> parquet_file = pq.ParquetFile(source)
>>
>> You can see how read_table does this in the source code:
>> https://github.com/apache/arrow/blob/16c442a03e2cf9c7748f0fa67b6694dbeb287fad/python/pyarrow/parquet.py#L1977
>>
>> I hope this helps.
>>
>>
>>
>> On Thu, 23 Dec 2021 at 05:17, Cindy McMullen <cm...@twitter.com>
>> wrote:
>>
>> Hi -
>>
>> I need to drop down to the ParquetFile API so I can have better control
>> over batch size for reading huge Parquet files.  The filename is:
>>
>>
>> *gs://graph_infra_steel_thread/output_pq/parquet/usersims/output-20211202-220329-20211202-220329-00-0012.parquet.snappy*
>>
>> This invocation fails:
>> *pqf = pq.ParquetFile(filename)*
>> "FileNotFoundError: [Errno 2] Failed to open local file
>> 'gs://graph_infra_steel_thread/output_pq/parquet/usersims/output-20211202-220329-20211202-220329-00-0012.parquet.snappy'.
>> Detail: [errno 2] No such file or directory"
>>
>> While this API, using the same, succeeds because I can specify 'gs'
>> filesystem.
>> *table = pq.read_table(filename, filesystem=gs,
>> use_legacy_dataset=False) *
>>
>> I don't see a way to specify 'filesystem' on the ParquetFile API
>> <https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetFile.html#pyarrow.parquet.ParquetFile>.
>> Is there any way to read a GCS file using ParquetFile?
>>
>> If not, can you show me the code for reading batches using pq.read_table
>> or one of the other Arrow Parquet APIs
>> <https://arrow.apache.org/docs/python/api/formats.html#parquet-files>?
>>
>> Thanks -
>>
>> -- Cindy
>>
>>
>>
>> --
>> Partha Dutta
>> partha.dutta@gmail.com
>>
>>
>> --
>> Partha Dutta
>> partha.dutta@gmail.com
>>
>>
>>

Re: ParquetFile API and GCS file

Posted by Cindy McMullen <cm...@twitter.com>.
Can you give an example of using the ParquteFile.iter_batches() API?  I can
see it returns a 'generator' class, but not sure how to iterate over the
results to get at the underlying row data.

On Mon, Dec 27, 2021 at 11:33 AM David Li <li...@apache.org> wrote:

> Ah, I'm sorry. I misremembered, I was recalling the implementation of
> ReadOneRowGroup/ReadRowGroups, but iter_batches() boils down to
> GetRecordBatchReader which does read at a finer granularity.
>
> -David
>
> On Mon, Dec 27, 2021, at 11:50, Micah Kornfield wrote:
>
> Just a shot in the dark, but how many row groups are there in that 1 GB
> file? IIRC, the reader loads an entire row group's worth of rows at once.
>
>
> Can you clarify what you mean by "loads" I thought it only loaded the
> compressed data at once, and then read per page (I could be misremembering
> or thinking this was an aspirational goal).
>
> On Fri, Dec 24, 2021 at 4:01 PM Partha Dutta <pa...@gmail.com>
> wrote:
>
> I see 5 row groups. This parquet file contains 1.8 million records
>
> On Fri, Dec 24, 2021 at 4:51 PM David Li <li...@apache.org> wrote:
>
>
> Just a shot in the dark, but how many row groups are there in that 1 GB
> file? IIRC, the reader loads an entire row group's worth of rows at once.
>
>
> -David
>
> On Fri, Dec 24, 2021, at 17:45, Partha Dutta wrote:
>
> I have a similar issue with trying to read huge 1GB parquet files from
> Azure DataLake Storage. I'm trying to read the file in small chunks using
> the ParquetFile.iter_batches method, but it seems like the entire file is
> read into memory before the first batch is returned. I am using the Azure
> SDK for python and another python package (pyarrowfs-adlgen2). Has anyone
> faced a problem similar to what I am seeing, or is there a workaround?
>
> On Fri, Dec 24, 2021 at 2:11 PM Cindy McMullen <cm...@twitter.com>
> wrote:
>
> Thanks, Arthur, this helps.  The complete code example is:
>
> filename = 'gs://' + files[0]
> gs = gcsfs.GCSFileSystem()
> f = gs.open(filename)
> pqf = pq.ParquetFile(f)
> pqf.metadata
>
>
> On Thu, Dec 23, 2021 at 1:48 AM Arthur Andres <ar...@gmail.com>
> wrote:
>
> Hi Cindy,
>
> In your case you'd have to pass a GCS file instance to the ParquetFile
> constructor. Something like this:
>
> source = fs.open_input_file(filename)
> parquet_file = pq.ParquetFile(source)
>
> You can see how read_table does this in the source code:
> https://github.com/apache/arrow/blob/16c442a03e2cf9c7748f0fa67b6694dbeb287fad/python/pyarrow/parquet.py#L1977
>
> I hope this helps.
>
>
>
> On Thu, 23 Dec 2021 at 05:17, Cindy McMullen <cm...@twitter.com>
> wrote:
>
> Hi -
>
> I need to drop down to the ParquetFile API so I can have better control
> over batch size for reading huge Parquet files.  The filename is:
>
>
> *gs://graph_infra_steel_thread/output_pq/parquet/usersims/output-20211202-220329-20211202-220329-00-0012.parquet.snappy*
>
> This invocation fails:
> *pqf = pq.ParquetFile(filename)*
> "FileNotFoundError: [Errno 2] Failed to open local file
> 'gs://graph_infra_steel_thread/output_pq/parquet/usersims/output-20211202-220329-20211202-220329-00-0012.parquet.snappy'.
> Detail: [errno 2] No such file or directory"
>
> While this API, using the same, succeeds because I can specify 'gs'
> filesystem.
> *table = pq.read_table(filename, filesystem=gs, use_legacy_dataset=False) *
>
> I don't see a way to specify 'filesystem' on the ParquetFile API
> <https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetFile.html#pyarrow.parquet.ParquetFile>.
> Is there any way to read a GCS file using ParquetFile?
>
> If not, can you show me the code for reading batches using pq.read_table
> or one of the other Arrow Parquet APIs
> <https://arrow.apache.org/docs/python/api/formats.html#parquet-files>?
>
> Thanks -
>
> -- Cindy
>
>
>
> --
> Partha Dutta
> partha.dutta@gmail.com
>
>
> --
> Partha Dutta
> partha.dutta@gmail.com
>
>
>

Re: ParquetFile API and GCS file

Posted by David Li <li...@apache.org>.
Ah, I'm sorry. I misremembered, I was recalling the implementation of ReadOneRowGroup/ReadRowGroups, but iter_batches() boils down to GetRecordBatchReader which does read at a finer granularity.

-David

On Mon, Dec 27, 2021, at 11:50, Micah Kornfield wrote:
>> Just a shot in the dark, but how many row groups are there in that 1 GB file? IIRC, the reader loads an entire row group's worth of rows at once.
> 
> Can you clarify what you mean by "loads" I thought it only loaded the compressed data at once, and then read per page (I could be misremembering or thinking this was an aspirational goal).
> 
> On Fri, Dec 24, 2021 at 4:01 PM Partha Dutta <pa...@gmail.com> wrote:
>> I see 5 row groups. This parquet file contains 1.8 million records
>> 
>> On Fri, Dec 24, 2021 at 4:51 PM David Li <li...@apache.org> wrote:
>>> __
>>> Just a shot in the dark, but how many row groups are there in that 1 GB file? IIRC, the reader loads an entire row group's worth of rows at once.
>>> 
>>> 
>>> -David
>>> 
>>> On Fri, Dec 24, 2021, at 17:45, Partha Dutta wrote:
>>>> I have a similar issue with trying to read huge 1GB parquet files from Azure DataLake Storage. I'm trying to read the file in small chunks using the ParquetFile.iter_batches method, but it seems like the entire file is read into memory before the first batch is returned. I am using the Azure SDK for python and another python package (pyarrowfs-adlgen2). Has anyone faced a problem similar to what I am seeing, or is there a workaround?
>>>> 
>>>> On Fri, Dec 24, 2021 at 2:11 PM Cindy McMullen <cm...@twitter.com> wrote:
>>>>> Thanks, Arthur, this helps.  The complete code example is:
>>>>> 
>>>>> `filename = 'gs://' + files[0]
>>>>> gs = gcsfs.GCSFileSystem()
>>>>> f = gs.open(filename)
>>>>> pqf = pq.ParquetFile(f)
>>>>> pqf.metadata`
>>>>> 
>>>>> On Thu, Dec 23, 2021 at 1:48 AM Arthur Andres <ar...@gmail.com> wrote:
>>>>>> Hi Cindy,
>>>>>> 
>>>>>> In your case you'd have to pass a GCS file instance to the ParquetFile constructor. Something like this:
>>>>>> 
>>>>>> source = fs.open_input_file(filename)
>>>>>> parquet_file = pq.ParquetFile(source)
>>>>>> 
>>>>>> You can see how read_table does this in the source code: https://github.com/apache/arrow/blob/16c442a03e2cf9c7748f0fa67b6694dbeb287fad/python/pyarrow/parquet.py#L1977
>>>>>> 
>>>>>> I hope this helps.
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Thu, 23 Dec 2021 at 05:17, Cindy McMullen <cm...@twitter.com> wrote:
>>>>>>> Hi -
>>>>>>> 
>>>>>>> I need to drop down to the ParquetFile API so I can have better control over batch size for reading huge Parquet files.  The filename is:
>>>>>>> 
>>>>>>> *gs://graph_infra_steel_thread/output_pq/parquet/usersims/output-20211202-220329-20211202-220329-00-0012.parquet.snappy*
>>>>>>> **
>>>>>>> This invocation fails:
>>>>>>> *pqf = pq.ParquetFile(filename)*
>>>>>>> "FileNotFoundError: [Errno 2] Failed to open local file 'gs://graph_infra_steel_thread/output_pq/parquet/usersims/output-20211202-220329-20211202-220329-00-0012.parquet.snappy'. Detail: [errno 2] No such file or directory"
>>>>>>> 
>>>>>>> While this API, using the same, succeeds because I can specify 'gs' filesystem.
>>>>>>> *table = pq.read_table(filename, filesystem=gs, use_legacy_dataset=False) *
>>>>>>> **
>>>>>>> I don't see a way to specify 'filesystem' on the ParquetFile API <https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetFile.html#pyarrow.parquet.ParquetFile>.  Is there any way to read a GCS file using ParquetFile?
>>>>>>> 
>>>>>>> If not, can you show me the code for reading batches using pq.read_table or one of the other Arrow Parquet APIs <https://arrow.apache.org/docs/python/api/formats.html#parquet-files>?
>>>>>>> 
>>>>>>> Thanks -
>>>>>>> 
>>>>>>> -- Cindy
>>>> 
>>>> 
>>>> -- 
>>>> Partha Dutta
>>>> partha.dutta@gmail.com
>>> 
>> -- 
>> Partha Dutta
>> partha.dutta@gmail.com

Re: ParquetFile API and GCS file

Posted by Micah Kornfield <em...@gmail.com>.
>
> Just a shot in the dark, but how many row groups are there in that 1 GB
> file? IIRC, the reader loads an entire row group's worth of rows at once.


Can you clarify what you mean by "loads" I thought it only loaded the
compressed data at once, and then read per page (I could be misremembering
or thinking this was an aspirational goal).

On Fri, Dec 24, 2021 at 4:01 PM Partha Dutta <pa...@gmail.com> wrote:

> I see 5 row groups. This parquet file contains 1.8 million records
>
> On Fri, Dec 24, 2021 at 4:51 PM David Li <li...@apache.org> wrote:
>
>> Just a shot in the dark, but how many row groups are there in that 1 GB
>> file? IIRC, the reader loads an entire row group's worth of rows at once.
>>
>>
>> -David
>>
>> On Fri, Dec 24, 2021, at 17:45, Partha Dutta wrote:
>>
>> I have a similar issue with trying to read huge 1GB parquet files from
>> Azure DataLake Storage. I'm trying to read the file in small chunks using
>> the ParquetFile.iter_batches method, but it seems like the entire file is
>> read into memory before the first batch is returned. I am using the Azure
>> SDK for python and another python package (pyarrowfs-adlgen2). Has anyone
>> faced a problem similar to what I am seeing, or is there a workaround?
>>
>> On Fri, Dec 24, 2021 at 2:11 PM Cindy McMullen <cm...@twitter.com>
>> wrote:
>>
>> Thanks, Arthur, this helps.  The complete code example is:
>>
>> filename = 'gs://' + files[0]
>> gs = gcsfs.GCSFileSystem()
>> f = gs.open(filename)
>> pqf = pq.ParquetFile(f)
>> pqf.metadata
>>
>>
>> On Thu, Dec 23, 2021 at 1:48 AM Arthur Andres <ar...@gmail.com>
>> wrote:
>>
>> Hi Cindy,
>>
>> In your case you'd have to pass a GCS file instance to the ParquetFile
>> constructor. Something like this:
>>
>> source = fs.open_input_file(filename)
>> parquet_file = pq.ParquetFile(source)
>>
>> You can see how read_table does this in the source code:
>> https://github.com/apache/arrow/blob/16c442a03e2cf9c7748f0fa67b6694dbeb287fad/python/pyarrow/parquet.py#L1977
>>
>> I hope this helps.
>>
>>
>>
>> On Thu, 23 Dec 2021 at 05:17, Cindy McMullen <cm...@twitter.com>
>> wrote:
>>
>> Hi -
>>
>> I need to drop down to the ParquetFile API so I can have better control
>> over batch size for reading huge Parquet files.  The filename is:
>>
>>
>> *gs://graph_infra_steel_thread/output_pq/parquet/usersims/output-20211202-220329-20211202-220329-00-0012.parquet.snappy*
>>
>> This invocation fails:
>> *pqf = pq.ParquetFile(filename)*
>> "FileNotFoundError: [Errno 2] Failed to open local file
>> 'gs://graph_infra_steel_thread/output_pq/parquet/usersims/output-20211202-220329-20211202-220329-00-0012.parquet.snappy'.
>> Detail: [errno 2] No such file or directory"
>>
>> While this API, using the same, succeeds because I can specify 'gs'
>> filesystem.
>> *table = pq.read_table(filename, filesystem=gs,
>> use_legacy_dataset=False) *
>>
>> I don't see a way to specify 'filesystem' on the ParquetFile API
>> <https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetFile.html#pyarrow.parquet.ParquetFile>.
>> Is there any way to read a GCS file using ParquetFile?
>>
>> If not, can you show me the code for reading batches using pq.read_table
>> or one of the other Arrow Parquet APIs
>> <https://arrow.apache.org/docs/python/api/formats.html#parquet-files>?
>>
>> Thanks -
>>
>> -- Cindy
>>
>>
>>
>> --
>> Partha Dutta
>> partha.dutta@gmail.com
>>
>>
>> --
> Partha Dutta
> partha.dutta@gmail.com
>

Re: ParquetFile API and GCS file

Posted by Partha Dutta <pa...@gmail.com>.
I see 5 row groups. This parquet file contains 1.8 million records

On Fri, Dec 24, 2021 at 4:51 PM David Li <li...@apache.org> wrote:

> Just a shot in the dark, but how many row groups are there in that 1 GB
> file? IIRC, the reader loads an entire row group's worth of rows at once.
>
>
> -David
>
> On Fri, Dec 24, 2021, at 17:45, Partha Dutta wrote:
>
> I have a similar issue with trying to read huge 1GB parquet files from
> Azure DataLake Storage. I'm trying to read the file in small chunks using
> the ParquetFile.iter_batches method, but it seems like the entire file is
> read into memory before the first batch is returned. I am using the Azure
> SDK for python and another python package (pyarrowfs-adlgen2). Has anyone
> faced a problem similar to what I am seeing, or is there a workaround?
>
> On Fri, Dec 24, 2021 at 2:11 PM Cindy McMullen <cm...@twitter.com>
> wrote:
>
> Thanks, Arthur, this helps.  The complete code example is:
>
> filename = 'gs://' + files[0]
> gs = gcsfs.GCSFileSystem()
> f = gs.open(filename)
> pqf = pq.ParquetFile(f)
> pqf.metadata
>
>
> On Thu, Dec 23, 2021 at 1:48 AM Arthur Andres <ar...@gmail.com>
> wrote:
>
> Hi Cindy,
>
> In your case you'd have to pass a GCS file instance to the ParquetFile
> constructor. Something like this:
>
> source = fs.open_input_file(filename)
> parquet_file = pq.ParquetFile(source)
>
> You can see how read_table does this in the source code:
> https://github.com/apache/arrow/blob/16c442a03e2cf9c7748f0fa67b6694dbeb287fad/python/pyarrow/parquet.py#L1977
>
> I hope this helps.
>
>
>
> On Thu, 23 Dec 2021 at 05:17, Cindy McMullen <cm...@twitter.com>
> wrote:
>
> Hi -
>
> I need to drop down to the ParquetFile API so I can have better control
> over batch size for reading huge Parquet files.  The filename is:
>
>
> *gs://graph_infra_steel_thread/output_pq/parquet/usersims/output-20211202-220329-20211202-220329-00-0012.parquet.snappy*
>
> This invocation fails:
> *pqf = pq.ParquetFile(filename)*
> "FileNotFoundError: [Errno 2] Failed to open local file
> 'gs://graph_infra_steel_thread/output_pq/parquet/usersims/output-20211202-220329-20211202-220329-00-0012.parquet.snappy'.
> Detail: [errno 2] No such file or directory"
>
> While this API, using the same, succeeds because I can specify 'gs'
> filesystem.
> *table = pq.read_table(filename, filesystem=gs, use_legacy_dataset=False) *
>
> I don't see a way to specify 'filesystem' on the ParquetFile API
> <https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetFile.html#pyarrow.parquet.ParquetFile>.
> Is there any way to read a GCS file using ParquetFile?
>
> If not, can you show me the code for reading batches using pq.read_table
> or one of the other Arrow Parquet APIs
> <https://arrow.apache.org/docs/python/api/formats.html#parquet-files>?
>
> Thanks -
>
> -- Cindy
>
>
>
> --
> Partha Dutta
> partha.dutta@gmail.com
>
>
> --
Partha Dutta
partha.dutta@gmail.com

Re: ParquetFile API and GCS file

Posted by David Li <li...@apache.org>.
Just a shot in the dark, but how many row groups are there in that 1 GB file? IIRC, the reader loads an entire row group's worth of rows at once.

-David

On Fri, Dec 24, 2021, at 17:45, Partha Dutta wrote:
> I have a similar issue with trying to read huge 1GB parquet files from Azure DataLake Storage. I'm trying to read the file in small chunks using the ParquetFile.iter_batches method, but it seems like the entire file is read into memory before the first batch is returned. I am using the Azure SDK for python and another python package (pyarrowfs-adlgen2). Has anyone faced a problem similar to what I am seeing, or is there a workaround?
> 
> On Fri, Dec 24, 2021 at 2:11 PM Cindy McMullen <cm...@twitter.com> wrote:
>> Thanks, Arthur, this helps.  The complete code example is:
>> 
>> `filename = 'gs://' + files[0]
>> gs = gcsfs.GCSFileSystem()
>> f = gs.open(filename)
>> pqf = pq.ParquetFile(f)
>> pqf.metadata`
>> 
>> On Thu, Dec 23, 2021 at 1:48 AM Arthur Andres <ar...@gmail.com> wrote:
>>> Hi Cindy,
>>> 
>>> In your case you'd have to pass a GCS file instance to the ParquetFile constructor. Something like this:
>>> 
>>> source = fs.open_input_file(filename)
>>> parquet_file = pq.ParquetFile(source)
>>> 
>>> You can see how read_table does this in the source code: https://github.com/apache/arrow/blob/16c442a03e2cf9c7748f0fa67b6694dbeb287fad/python/pyarrow/parquet.py#L1977
>>> 
>>> I hope this helps.
>>> 
>>> 
>>> 
>>> On Thu, 23 Dec 2021 at 05:17, Cindy McMullen <cm...@twitter.com> wrote:
>>>> Hi -
>>>> 
>>>> I need to drop down to the ParquetFile API so I can have better control over batch size for reading huge Parquet files.  The filename is:
>>>> 
>>>> *gs://graph_infra_steel_thread/output_pq/parquet/usersims/output-20211202-220329-20211202-220329-00-0012.parquet.snappy*
>>>> **
>>>> This invocation fails:
>>>> *pqf = pq.ParquetFile(filename)*
>>>> "FileNotFoundError: [Errno 2] Failed to open local file 'gs://graph_infra_steel_thread/output_pq/parquet/usersims/output-20211202-220329-20211202-220329-00-0012.parquet.snappy'. Detail: [errno 2] No such file or directory"
>>>> 
>>>> While this API, using the same, succeeds because I can specify 'gs' filesystem.
>>>> *table = pq.read_table(filename, filesystem=gs, use_legacy_dataset=False) *
>>>> **
>>>> I don't see a way to specify 'filesystem' on the ParquetFile API <https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetFile.html#pyarrow.parquet.ParquetFile>.  Is there any way to read a GCS file using ParquetFile?
>>>> 
>>>> If not, can you show me the code for reading batches using pq.read_table or one of the other Arrow Parquet APIs <https://arrow.apache.org/docs/python/api/formats.html#parquet-files>?
>>>> 
>>>> Thanks -
>>>> 
>>>> -- Cindy
> 
> 
> -- 
> Partha Dutta
> partha.dutta@gmail.com

Re: ParquetFile API and GCS file

Posted by Partha Dutta <pa...@gmail.com>.
I have a similar issue with trying to read huge 1GB parquet files from
Azure DataLake Storage. I'm trying to read the file in small chunks using
the ParquetFile.iter_batches method, but it seems like the entire file is
read into memory before the first batch is returned. I am using the Azure
SDK for python and another python package (pyarrowfs-adlgen2). Has anyone
faced a problem similar to what I am seeing, or is there a workaround?

On Fri, Dec 24, 2021 at 2:11 PM Cindy McMullen <cm...@twitter.com>
wrote:

> Thanks, Arthur, this helps.  The complete code example is:
>
> filename = 'gs://' + files[0]
> gs = gcsfs.GCSFileSystem()
> f = gs.open(filename)
> pqf = pq.ParquetFile(f)
> pqf.metadata
>
>
> On Thu, Dec 23, 2021 at 1:48 AM Arthur Andres <ar...@gmail.com>
> wrote:
>
>> Hi Cindy,
>>
>> In your case you'd have to pass a GCS file instance to the ParquetFile
>> constructor. Something like this:
>>
>> source = fs.open_input_file(filename)
>> parquet_file = pq.ParquetFile(source)
>>
>> You can see how read_table does this in the source code:
>> https://github.com/apache/arrow/blob/16c442a03e2cf9c7748f0fa67b6694dbeb287fad/python/pyarrow/parquet.py#L1977
>>
>> I hope this helps.
>>
>>
>>
>> On Thu, 23 Dec 2021 at 05:17, Cindy McMullen <cm...@twitter.com>
>> wrote:
>>
>>> Hi -
>>>
>>> I need to drop down to the ParquetFile API so I can have better control
>>> over batch size for reading huge Parquet files.  The filename is:
>>>
>>>
>>> *gs://graph_infra_steel_thread/output_pq/parquet/usersims/output-20211202-220329-20211202-220329-00-0012.parquet.snappy*
>>>
>>> This invocation fails:
>>> *pqf = pq.ParquetFile(filename)*
>>> "FileNotFoundError: [Errno 2] Failed to open local file
>>> 'gs://graph_infra_steel_thread/output_pq/parquet/usersims/output-20211202-220329-20211202-220329-00-0012.parquet.snappy'.
>>> Detail: [errno 2] No such file or directory"
>>>
>>> While this API, using the same, succeeds because I can specify 'gs'
>>> filesystem.
>>> *table = pq.read_table(filename, filesystem=gs,
>>> use_legacy_dataset=False) *
>>>
>>> I don't see a way to specify 'filesystem' on the ParquetFile API
>>> <https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetFile.html#pyarrow.parquet.ParquetFile>.
>>> Is there any way to read a GCS file using ParquetFile?
>>>
>>> If not, can you show me the code for reading batches using pq.read_table
>>> or one of the other Arrow Parquet APIs
>>> <https://arrow.apache.org/docs/python/api/formats.html#parquet-files>?
>>>
>>> Thanks -
>>>
>>> -- Cindy
>>>
>>

-- 
Partha Dutta
partha.dutta@gmail.com

Re: ParquetFile API and GCS file

Posted by Cindy McMullen <cm...@twitter.com>.
Thanks, Arthur, this helps.  The complete code example is:

filename = 'gs://' + files[0]
gs = gcsfs.GCSFileSystem()
f = gs.open(filename)
pqf = pq.ParquetFile(f)
pqf.metadata


On Thu, Dec 23, 2021 at 1:48 AM Arthur Andres <ar...@gmail.com>
wrote:

> Hi Cindy,
>
> In your case you'd have to pass a GCS file instance to the ParquetFile
> constructor. Something like this:
>
> source = fs.open_input_file(filename)
> parquet_file = pq.ParquetFile(source)
>
> You can see how read_table does this in the source code:
> https://github.com/apache/arrow/blob/16c442a03e2cf9c7748f0fa67b6694dbeb287fad/python/pyarrow/parquet.py#L1977
>
> I hope this helps.
>
>
>
> On Thu, 23 Dec 2021 at 05:17, Cindy McMullen <cm...@twitter.com>
> wrote:
>
>> Hi -
>>
>> I need to drop down to the ParquetFile API so I can have better control
>> over batch size for reading huge Parquet files.  The filename is:
>>
>>
>> *gs://graph_infra_steel_thread/output_pq/parquet/usersims/output-20211202-220329-20211202-220329-00-0012.parquet.snappy*
>>
>> This invocation fails:
>> *pqf = pq.ParquetFile(filename)*
>> "FileNotFoundError: [Errno 2] Failed to open local file
>> 'gs://graph_infra_steel_thread/output_pq/parquet/usersims/output-20211202-220329-20211202-220329-00-0012.parquet.snappy'.
>> Detail: [errno 2] No such file or directory"
>>
>> While this API, using the same, succeeds because I can specify 'gs'
>> filesystem.
>> *table = pq.read_table(filename, filesystem=gs,
>> use_legacy_dataset=False) *
>>
>> I don't see a way to specify 'filesystem' on the ParquetFile API
>> <https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetFile.html#pyarrow.parquet.ParquetFile>.
>> Is there any way to read a GCS file using ParquetFile?
>>
>> If not, can you show me the code for reading batches using pq.read_table
>> or one of the other Arrow Parquet APIs
>> <https://arrow.apache.org/docs/python/api/formats.html#parquet-files>?
>>
>> Thanks -
>>
>> -- Cindy
>>
>

Re: ParquetFile API and GCS file

Posted by Arthur Andres <ar...@gmail.com>.
Hi Cindy,

In your case you'd have to pass a GCS file instance to the ParquetFile
constructor. Something like this:

source = fs.open_input_file(filename)
parquet_file = pq.ParquetFile(source)

You can see how read_table does this in the source code:
https://github.com/apache/arrow/blob/16c442a03e2cf9c7748f0fa67b6694dbeb287fad/python/pyarrow/parquet.py#L1977

I hope this helps.



On Thu, 23 Dec 2021 at 05:17, Cindy McMullen <cm...@twitter.com> wrote:

> Hi -
>
> I need to drop down to the ParquetFile API so I can have better control
> over batch size for reading huge Parquet files.  The filename is:
>
>
> *gs://graph_infra_steel_thread/output_pq/parquet/usersims/output-20211202-220329-20211202-220329-00-0012.parquet.snappy*
>
> This invocation fails:
> *pqf = pq.ParquetFile(filename)*
> "FileNotFoundError: [Errno 2] Failed to open local file
> 'gs://graph_infra_steel_thread/output_pq/parquet/usersims/output-20211202-220329-20211202-220329-00-0012.parquet.snappy'.
> Detail: [errno 2] No such file or directory"
>
> While this API, using the same, succeeds because I can specify 'gs'
> filesystem.
> *table = pq.read_table(filename, filesystem=gs, use_legacy_dataset=False) *
>
> I don't see a way to specify 'filesystem' on the ParquetFile API
> <https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetFile.html#pyarrow.parquet.ParquetFile>.
> Is there any way to read a GCS file using ParquetFile?
>
> If not, can you show me the code for reading batches using pq.read_table
> or one of the other Arrow Parquet APIs
> <https://arrow.apache.org/docs/python/api/formats.html#parquet-files>?
>
> Thanks -
>
> -- Cindy
>