You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Damien Ready (Jira)" <ji...@apache.org> on 2021/11/20 07:36:00 UTC
[jira] [Created] (ARROW-14781) Improved Tooling/Documentation on Constructing Larger than Memory Parquet

Damien Ready created ARROW-14781:
------------------------------------

             Summary: Improved Tooling/Documentation on Constructing Larger than Memory Parquet
                 Key: ARROW-14781
                 URL: https://issues.apache.org/jira/browse/ARROW-14781
             Project: Apache Arrow
          Issue Type: Improvement
            Reporter: Damien Ready


I have ~800GBs of csvs distributed across ~1200 files and a mere 32GB of RAM. My objective is to incrementally build a parquet dataset holding the collection. I can only hold a small subset of the data in memory.

Following the docs as best I could, I was able to hack together a workflow that will do what I need, but it seems overly complex. I hope my problem is not out of scope, so I would love it if there was an effort to:

1) streamline the APIs to make this more straight-forward
2) better documentation on how to approach this problem
3) out of the box CLI utilities that would do this without any effort on my part

Expanding on 3), I was imagining something like a `parquet-cat`, `parquet-append`, `parquet-sample`, `parquet-metadata` or similar that would allow interacting with these files from the terminal. As it is, they are just blobs that require additional tooling to get even the barest sense of what is within.

Reproducible example below (provided a directory of csvs). Happy to hear what I missed that would have made this more straight-forward. Or that would also generate the parquet metadata at the same time.

```python
import itertools
import os

import pandas as pd
import pyarrow as pa
import pyarrow.dataset as ds

def gen_batches(root_directory):
    for fname in os.listdir(root_directory):
        dataf = pd.read_csv(os.path.join(root_directory, fname))
        # insert data munging here (ie I need pandas in the workflow)

        # PyArrow dataset would only consume an iterable of batches
        for batch in pa.Table.from_pandas(dataf).to_batches():
            yield batch


batches = gen_batches("src_csv_data_dir/")

# using the write_dataset method requires providing the schema, which is not accessible from a batch?
peek_batch = batches.__next__()
# needed to build a table to get to the schema
schema = pa.Table.from_batches([peek_batch]).schema

# consumed the first entry of the generator, rebuild it here
renew_gen_batches = itertools.chain([peek_batch], batches)

ds.write_dataset(renew_gen_batches, base_dir="parquet_dst.parquet", format="parquet", schema=schema)
# attempting write_dataset with an iterable of Tables threw: pyarrow.lib.ArrowTypeError: Could not unwrap RecordBatch from Python object of type 'pyarrow.lib.Table'
```



--
This message was sent by Atlassian Jira
(v8.20.1#820001)