You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@arrow.apache.org by Ryan Kuhns <rn...@gmail.com> on 2022/11/09 18:14:59 UTC

Questions about working with large multi-file zipped CSV data

Hi Everyone,

I’m using pyarrow to read, process, store and analyze some large files (~460GB zipped on 400+ files updated quarterly).

I’ve have a couple thoughts/questions come up as I have worked through the process. First two questions are mainly informational, wanting to confirm what I’ve inferred from existing docs.

1. I know pyarrow has functionality to uncompress a zipped file with a single CSV in it, but in my case I have 3 files in the zip. I’m currently using Python’s zipfile to find and open the file I want in the zip and then I am reading it with pyarrow.read_csv. I wanted to confirm there isn’t pyarrow functionality that might be able to tell me the files in the zip and let me select the one to unzip and read.

2. Some of the files end up being larger than memory when unzipped. In this case I’m using the file size to switch over and use open_csv instead of read_csv. Is there any plan for open_csv to be multithreaded in a future release (didn’t see anything on Jira, but I’m not great at searching on it)?

3. My data has lots of columns that are dimensions (with low cardinality) with longish string values and a large number of rows. Since I have files getting close to or above my available memory when unzipped, I need to be as memory efficient as possible. Converting these to dictionaries via ConvertOptions helps with the in-memory size. But then I get errors when looking to join tables together later (due to functionality to unify dictionaries not being implemented yet). Is that something that will be added? How about the ability to provide a user dictionary that should be used in the encoding (as optional param, fallback to current functionality when not provided). Seems like that would reduce the need to infer the dictionary from the data when encoding. It would be nice to ensure the same dictionary mapping is used for a column across each file I read in. It seems like I can’t guarantee that currently. A related feature that would solve my issue would be a way to easily map a columns values to other values on read. I’d imagine this would be something in ConvertOptions, where you could specify a column and the mapping to use (parameter accepting list of name, mapping tuples?). The end result would be the ability to convert a string column to something like int16 on read via the mapping. This would be more space efficient and also avoid the inability to join on dictionary columns I am seeing currently. 

Thanks,

Ryan

Re: Questions about working with large multi-file zipped CSV data

Posted by Ryan Kuhns <rn...@gmail.com>.

Hi Weston,

Thanks for the information. That will help me make an informed decision.

I wish I was better with C++ so I could help parallelize the CSV parsing.

It could be useful to have some additional examples in the documentation on workflow patterns. If there is interest, that is something I would be up for contributing (assuming I can bounce some of these questions off everyone with more experience).

Thanks again,

Ryan







> On Nov 9, 2022, at 7:55 PM, Weston Pace <we...@gmail.com> wrote:
> 
> 
>> 
>> 2. Some of the files end up being larger than memory when unzipped. In this case I’m using the file size to switch over and use open_csv instead of read_csv. Is there any plan for open_csv to be multithreaded in a future release (didn’t see anything on Jira, but I’m not great at searching on it)?
> 
> There is a PR in progress[1] which will add parallel reads to
> `open_csv` when provided with a random access file (e.g. not just an
> input stream).  This is important when reading from S3 but perhaps not
> as big of a deal when reading from a local disk (which usually doesn't
> support a ton of parallelism).  The streaming CSV reader's parsing is
> also not very parallel and could be improved (I would presume to get
> pretty close to read_csv performance).  However, I don't know anyone
> currently working on this.
> 
>> If I go with this approach, will the dataset to batches read be single-threaded (reading csv format) like open_csv? That is obviously not an issue for large files I would have had to use open_csv for anyway, but if the eventual dataset API read is single threaded, I might still want to use read_csv and process columns post read for smaller datasets.
> 
> dataset.to_batches is built on top of the streaming CSV reader (e.g.
> open_csv).  However, any compute work done by datasets (e.g.
> dictionary encoding, joins, casts, etc.) will be done in parallel.
> 
> [1] https://github.com/apache/arrow/pull/14269
> 
>> On Wed, Nov 9, 2022 at 3:21 PM Ryan Kuhns <rn...@gmail.com> wrote:
>> 
>> Hi Everyone,
>> 
>> Adam’s reply got me thinking about using the dataset API to overcome the problem I was facing in my third question. It seems like I could use the column projection to provide a mapping of from strings to integer lookup values. Then similar to the writing large amounts of data example (https://arrow.Apache.org/docs/Python/dataset.html) I can pass the dataset to write_dataset and never have everything in memory.
>> 
>> If I go with this approach, will the dataset to batches read be single-threaded (reading csv format) like open_csv? That is obviously not an issue for large files I would have had to use open_csv for anyway, but if the eventual dataset API read is single threaded, I might still want to use read_csv and process columns post read for smaller datasets.
>> 
>> Thanks,
>> 
>> Ryan
>> 
>> On Nov 9, 2022, at 4:07 PM, Ryan Kuhns <rn...@gmail.com> wrote:
>> 
>> 
>> Adam,
>> 
>> Thanks for pointing me to that. The fsspec approach looks like it will be helpful and the code snippet give me a good starting point.
>> 
>> -Ryan
>> 
>> On Nov 9, 2022, at 2:42 PM, Kirby, Adam <ak...@gmail.com> wrote:
>> 
>> 
>> Hi Ryan,
>> 
>> For your first question of a ZIP of multiple CSVs, I've had good luck [2] combining fsspec [1] with pyarrow dataset to process ZIPs of multiple CSVs. fsspec allows you to manage how much RAM you use on the read side with a few different cache configs.
>> 
>> In case helpful, I sent a python snippet earlier. [3]
>> 
>> [1] https://filesystem-spec.readthedocs.io/en/latest/_modules/fsspec/implementations/zip.html
>> 
>> [2] The idea was proposed by emkornfield@gmail.com on this list and proved very helpful.
>> 
>> [3] https://www.mail-archive.com/user@arrow.apache.org/msg02176.html
>> 
>> 
>>> On Wed, Nov 9, 2022, 12:15 PM Ryan Kuhns <rn...@gmail.com> wrote:
>>> 
>>> Hi Everyone,
>>> 
>>> I’m using pyarrow to read, process, store and analyze some large files (~460GB zipped on 400+ files updated quarterly).
>>> 
>>> I’ve have a couple thoughts/questions come up as I have worked through the process. First two questions are mainly informational, wanting to confirm what I’ve inferred from existing docs.
>>> 
>>> 1. I know pyarrow has functionality to uncompress a zipped file with a single CSV in it, but in my case I have 3 files in the zip. I’m currently using Python’s zipfile to find and open the file I want in the zip and then I am reading it with pyarrow.read_csv. I wanted to confirm there isn’t pyarrow functionality that might be able to tell me the files in the zip and let me select the one to unzip and read.
>>> 
>>> 2. Some of the files end up being larger than memory when unzipped. In this case I’m using the file size to switch over and use open_csv instead of read_csv. Is there any plan for open_csv to be multithreaded in a future release (didn’t see anything on Jira, but I’m not great at searching on it)?
>>> 
>>> 3. My data has lots of columns that are dimensions (with low cardinality) with longish string values and a large number of rows. Since I have files getting close to or above my available memory when unzipped, I need to be as memory efficient as possible. Converting these to dictionaries via ConvertOptions helps with the in-memory size. But then I get errors when looking to join tables together later (due to functionality to unify dictionaries not being implemented yet). Is that something that will be added? How about the ability to provide a user dictionary that should be used in the encoding (as optional param, fallback to current functionality when not provided). Seems like that would reduce the need to infer the dictionary from the data when encoding. It would be nice to ensure the same dictionary mapping is used for a column across each file I read in. It seems like I can’t guarantee that currently. A related feature that would solve my issue would be a way to easily map a columns values to other values on read. I’d imagine this would be something in ConvertOptions, where you could specify a column and the mapping to use (parameter accepting list of name, mapping tuples?). The end result would be the ability to convert a string column to something like int16 on read via the mapping. This would be more space efficient and also avoid the inability to join on dictionary columns I am seeing currently.
>>> 
>>> Thanks,
>>> 
>>> Ryan
>>>

Re: Questions about working with large multi-file zipped CSV data

Posted by Weston Pace <we...@gmail.com>.

> 2. Some of the files end up being larger than memory when unzipped. In this case I’m using the file size to switch over and use open_csv instead of read_csv. Is there any plan for open_csv to be multithreaded in a future release (didn’t see anything on Jira, but I’m not great at searching on it)?

There is a PR in progress[1] which will add parallel reads to
`open_csv` when provided with a random access file (e.g. not just an
input stream).  This is important when reading from S3 but perhaps not
as big of a deal when reading from a local disk (which usually doesn't
support a ton of parallelism).  The streaming CSV reader's parsing is
also not very parallel and could be improved (I would presume to get
pretty close to read_csv performance).  However, I don't know anyone
currently working on this.

> If I go with this approach, will the dataset to batches read be single-threaded (reading csv format) like open_csv? That is obviously not an issue for large files I would have had to use open_csv for anyway, but if the eventual dataset API read is single threaded, I might still want to use read_csv and process columns post read for smaller datasets.

dataset.to_batches is built on top of the streaming CSV reader (e.g.
open_csv).  However, any compute work done by datasets (e.g.
dictionary encoding, joins, casts, etc.) will be done in parallel.

[1] https://github.com/apache/arrow/pull/14269

On Wed, Nov 9, 2022 at 3:21 PM Ryan Kuhns <rn...@gmail.com> wrote:
>
> Hi Everyone,
>
> Adam’s reply got me thinking about using the dataset API to overcome the problem I was facing in my third question. It seems like I could use the column projection to provide a mapping of from strings to integer lookup values. Then similar to the writing large amounts of data example (https://arrow.Apache.org/docs/Python/dataset.html) I can pass the dataset to write_dataset and never have everything in memory.
>
> If I go with this approach, will the dataset to batches read be single-threaded (reading csv format) like open_csv? That is obviously not an issue for large files I would have had to use open_csv for anyway, but if the eventual dataset API read is single threaded, I might still want to use read_csv and process columns post read for smaller datasets.
>
> Thanks,
>
> Ryan
>
> On Nov 9, 2022, at 4:07 PM, Ryan Kuhns <rn...@gmail.com> wrote:
>
> 
> Adam,
>
> Thanks for pointing me to that. The fsspec approach looks like it will be helpful and the code snippet give me a good starting point.
>
> -Ryan
>
> On Nov 9, 2022, at 2:42 PM, Kirby, Adam <ak...@gmail.com> wrote:
>
> 
> Hi Ryan,
>
> For your first question of a ZIP of multiple CSVs, I've had good luck [2] combining fsspec [1] with pyarrow dataset to process ZIPs of multiple CSVs. fsspec allows you to manage how much RAM you use on the read side with a few different cache configs.
>
> In case helpful, I sent a python snippet earlier. [3]
>
> [1] https://filesystem-spec.readthedocs.io/en/latest/_modules/fsspec/implementations/zip.html
>
> [2] The idea was proposed by emkornfield@gmail.com on this list and proved very helpful.
>
> [3] https://www.mail-archive.com/user@arrow.apache.org/msg02176.html
>
>
> On Wed, Nov 9, 2022, 12:15 PM Ryan Kuhns <rn...@gmail.com> wrote:
>>
>> Hi Everyone,
>>
>> I’m using pyarrow to read, process, store and analyze some large files (~460GB zipped on 400+ files updated quarterly).
>>
>> I’ve have a couple thoughts/questions come up as I have worked through the process. First two questions are mainly informational, wanting to confirm what I’ve inferred from existing docs.
>>
>> 1. I know pyarrow has functionality to uncompress a zipped file with a single CSV in it, but in my case I have 3 files in the zip. I’m currently using Python’s zipfile to find and open the file I want in the zip and then I am reading it with pyarrow.read_csv. I wanted to confirm there isn’t pyarrow functionality that might be able to tell me the files in the zip and let me select the one to unzip and read.
>>
>> 2. Some of the files end up being larger than memory when unzipped. In this case I’m using the file size to switch over and use open_csv instead of read_csv. Is there any plan for open_csv to be multithreaded in a future release (didn’t see anything on Jira, but I’m not great at searching on it)?
>>
>> 3. My data has lots of columns that are dimensions (with low cardinality) with longish string values and a large number of rows. Since I have files getting close to or above my available memory when unzipped, I need to be as memory efficient as possible. Converting these to dictionaries via ConvertOptions helps with the in-memory size. But then I get errors when looking to join tables together later (due to functionality to unify dictionaries not being implemented yet). Is that something that will be added? How about the ability to provide a user dictionary that should be used in the encoding (as optional param, fallback to current functionality when not provided). Seems like that would reduce the need to infer the dictionary from the data when encoding. It would be nice to ensure the same dictionary mapping is used for a column across each file I read in. It seems like I can’t guarantee that currently. A related feature that would solve my issue would be a way to easily map a columns values to other values on read. I’d imagine this would be something in ConvertOptions, where you could specify a column and the mapping to use (parameter accepting list of name, mapping tuples?). The end result would be the ability to convert a string column to something like int16 on read via the mapping. This would be more space efficient and also avoid the inability to join on dictionary columns I am seeing currently.
>>
>> Thanks,
>>
>> Ryan
>>

Re: Questions about working with large multi-file zipped CSV data

Posted by Ryan Kuhns <rn...@gmail.com>.

Hi Everyone,

Adam’s reply got me thinking about using the dataset API to overcome the problem I was facing in my third question. It seems like I could use the column projection to provide a mapping of from strings to integer lookup values. Then similar to the writing large amounts of data example (https://arrow.Apache.org/docs/Python/dataset.html) I can pass the dataset to write_dataset and never have everything in memory.

If I go with this approach, will the dataset to batches read be single-threaded (reading csv format) like open_csv? That is obviously not an issue for large files I would have had to use open_csv for anyway, but if the eventual dataset API read is single threaded, I might still want to use read_csv and process columns post read for smaller datasets.

Thanks,

Ryan

> On Nov 9, 2022, at 4:07 PM, Ryan Kuhns <rn...@gmail.com> wrote:
> 
> 
> Adam,
> 
> Thanks for pointing me to that. The fsspec approach looks like it will be helpful and the code snippet give me a good starting point.
> 
> -Ryan
> 
>>> On Nov 9, 2022, at 2:42 PM, Kirby, Adam <ak...@gmail.com> wrote:
>>> 
>> 
>> Hi Ryan,
>> 
>> For your first question of a ZIP of multiple CSVs, I've had good luck [2] combining fsspec [1] with pyarrow dataset to process ZIPs of multiple CSVs. fsspec allows you to manage how much RAM you use on the read side with a few different cache configs.
>> 
>> In case helpful, I sent a python snippet earlier. [3]
>> 
>> [1] https://filesystem-spec.readthedocs.io/en/latest/_modules/fsspec/implementations/zip.html
>> 
>> [2] The idea was proposed by emkornfield@gmail.com on this list and proved very helpful. 
>> 
>> [3] https://www.mail-archive.com/user@arrow.apache.org/msg02176.html
>> 
>> 
>>> On Wed, Nov 9, 2022, 12:15 PM Ryan Kuhns <rn...@gmail.com> wrote:
>>> Hi Everyone,
>>> 
>>> I’m using pyarrow to read, process, store and analyze some large files (~460GB zipped on 400+ files updated quarterly).
>>> 
>>> I’ve have a couple thoughts/questions come up as I have worked through the process. First two questions are mainly informational, wanting to confirm what I’ve inferred from existing docs.
>>> 
>>> 1. I know pyarrow has functionality to uncompress a zipped file with a single CSV in it, but in my case I have 3 files in the zip. I’m currently using Python’s zipfile to find and open the file I want in the zip and then I am reading it with pyarrow.read_csv. I wanted to confirm there isn’t pyarrow functionality that might be able to tell me the files in the zip and let me select the one to unzip and read.
>>> 
>>> 2. Some of the files end up being larger than memory when unzipped. In this case I’m using the file size to switch over and use open_csv instead of read_csv. Is there any plan for open_csv to be multithreaded in a future release (didn’t see anything on Jira, but I’m not great at searching on it)?
>>> 
>>> 3. My data has lots of columns that are dimensions (with low cardinality) with longish string values and a large number of rows. Since I have files getting close to or above my available memory when unzipped, I need to be as memory efficient as possible. Converting these to dictionaries via ConvertOptions helps with the in-memory size. But then I get errors when looking to join tables together later (due to functionality to unify dictionaries not being implemented yet). Is that something that will be added? How about the ability to provide a user dictionary that should be used in the encoding (as optional param, fallback to current functionality when not provided). Seems like that would reduce the need to infer the dictionary from the data when encoding. It would be nice to ensure the same dictionary mapping is used for a column across each file I read in. It seems like I can’t guarantee that currently. A related feature that would solve my issue would be a way to easily map a columns values to other values on read. I’d imagine this would be something in ConvertOptions, where you could specify a column and the mapping to use (parameter accepting list of name, mapping tuples?). The end result would be the ability to convert a string column to something like int16 on read via the mapping. This would be more space efficient and also avoid the inability to join on dictionary columns I am seeing currently. 
>>> 
>>> Thanks,
>>> 
>>> Ryan
>>>

Re: Questions about working with large multi-file zipped CSV data

Posted by Ryan Kuhns <rn...@gmail.com>.

Adam,

Thanks for pointing me to that. The fsspec approach looks like it will be helpful and the code snippet give me a good starting point.

-Ryan

> On Nov 9, 2022, at 2:42 PM, Kirby, Adam <ak...@gmail.com> wrote:
> 
> 
> Hi Ryan,
> 
> For your first question of a ZIP of multiple CSVs, I've had good luck [2] combining fsspec [1] with pyarrow dataset to process ZIPs of multiple CSVs. fsspec allows you to manage how much RAM you use on the read side with a few different cache configs.
> 
> In case helpful, I sent a python snippet earlier. [3]
> 
> [1] https://filesystem-spec.readthedocs.io/en/latest/_modules/fsspec/implementations/zip.html
> 
> [2] The idea was proposed by emkornfield@gmail.com on this list and proved very helpful. 
> 
> [3] https://www.mail-archive.com/user@arrow.apache.org/msg02176.html
> 
> 
>> On Wed, Nov 9, 2022, 12:15 PM Ryan Kuhns <rn...@gmail.com> wrote:
>> Hi Everyone,
>> 
>> I’m using pyarrow to read, process, store and analyze some large files (~460GB zipped on 400+ files updated quarterly).
>> 
>> I’ve have a couple thoughts/questions come up as I have worked through the process. First two questions are mainly informational, wanting to confirm what I’ve inferred from existing docs.
>> 
>> 1. I know pyarrow has functionality to uncompress a zipped file with a single CSV in it, but in my case I have 3 files in the zip. I’m currently using Python’s zipfile to find and open the file I want in the zip and then I am reading it with pyarrow.read_csv. I wanted to confirm there isn’t pyarrow functionality that might be able to tell me the files in the zip and let me select the one to unzip and read.
>> 
>> 2. Some of the files end up being larger than memory when unzipped. In this case I’m using the file size to switch over and use open_csv instead of read_csv. Is there any plan for open_csv to be multithreaded in a future release (didn’t see anything on Jira, but I’m not great at searching on it)?
>> 
>> 3. My data has lots of columns that are dimensions (with low cardinality) with longish string values and a large number of rows. Since I have files getting close to or above my available memory when unzipped, I need to be as memory efficient as possible. Converting these to dictionaries via ConvertOptions helps with the in-memory size. But then I get errors when looking to join tables together later (due to functionality to unify dictionaries not being implemented yet). Is that something that will be added? How about the ability to provide a user dictionary that should be used in the encoding (as optional param, fallback to current functionality when not provided). Seems like that would reduce the need to infer the dictionary from the data when encoding. It would be nice to ensure the same dictionary mapping is used for a column across each file I read in. It seems like I can’t guarantee that currently. A related feature that would solve my issue would be a way to easily map a columns values to other values on read. I’d imagine this would be something in ConvertOptions, where you could specify a column and the mapping to use (parameter accepting list of name, mapping tuples?). The end result would be the ability to convert a string column to something like int16 on read via the mapping. This would be more space efficient and also avoid the inability to join on dictionary columns I am seeing currently. 
>> 
>> Thanks,
>> 
>> Ryan
>>

Re: Questions about working with large multi-file zipped CSV data

Posted by "Kirby, Adam" <ak...@gmail.com>.

Hi Ryan,

For your first question of a ZIP of multiple CSVs, I've had good luck [2]
combining fsspec [1] with pyarrow dataset to process ZIPs of multiple CSVs.
fsspec allows you to manage how much RAM you use on the read side with a
few different cache configs.

In case helpful, I sent a python snippet earlier. [3]

[1]
https://filesystem-spec.readthedocs.io/en/latest/_modules/fsspec/implementations/zip.html

[2] The idea was proposed by emkornfield@gmail.com on this list and proved
very helpful.

[3] https://www.mail-archive.com/user@arrow.apache.org/msg02176.html


On Wed, Nov 9, 2022, 12:15 PM Ryan Kuhns <rn...@gmail.com> wrote:

> Hi Everyone,
>
> I’m using pyarrow to read, process, store and analyze some large files
> (~460GB zipped on 400+ files updated quarterly).
>
> I’ve have a couple thoughts/questions come up as I have worked through the
> process. First two questions are mainly informational, wanting to confirm
> what I’ve inferred from existing docs.
>
> 1. I know pyarrow has functionality to uncompress a zipped file with a
> single CSV in it, but in my case I have 3 files in the zip. I’m currently
> using Python’s zipfile to find and open the file I want in the zip and then
> I am reading it with pyarrow.read_csv. I wanted to confirm there isn’t
> pyarrow functionality that might be able to tell me the files in the zip
> and let me select the one to unzip and read.
>
> 2. Some of the files end up being larger than memory when unzipped. In
> this case I’m using the file size to switch over and use open_csv instead
> of read_csv. Is there any plan for open_csv to be multithreaded in a future
> release (didn’t see anything on Jira, but I’m not great at searching on it)?
>
> 3. My data has lots of columns that are dimensions (with low cardinality)
> with longish string values and a large number of rows. Since I have files
> getting close to or above my available memory when unzipped, I need to be
> as memory efficient as possible. Converting these to dictionaries via
> ConvertOptions helps with the in-memory size. But then I get errors when
> looking to join tables together later (due to functionality to unify
> dictionaries not being implemented yet). Is that something that will be
> added? How about the ability to provide a user dictionary that should be
> used in the encoding (as optional param, fallback to current functionality
> when not provided). Seems like that would reduce the need to infer the
> dictionary from the data when encoding. It would be nice to ensure the same
> dictionary mapping is used for a column across each file I read in. It
> seems like I can’t guarantee that currently. A related feature that would
> solve my issue would be a way to easily map a columns values to other
> values on read. I’d imagine this would be something in ConvertOptions,
> where you could specify a column and the mapping to use (parameter
> accepting list of name, mapping tuples?). The end result would be the
> ability to convert a string column to something like int16 on read via the
> mapping. This would be more space efficient and also avoid the inability to
> join on dictionary columns I am seeing currently.
>
> Thanks,
>
> Ryan
>
>