You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Joris Van den Bossche (Jira)" <ji...@apache.org> on 2021/11/23 10:23:00 UTC
[jira] [Commented] (ARROW-14781) Improved Tooling/Documentation on Constructing Larger than Memory Parquet

    [ https://issues.apache.org/jira/browse/ARROW-14781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17447879#comment-17447879 ] 

Joris Van den Bossche commented on ARROW-14781:
-----------------------------------------------

[~ludicrous_speed] question about your example: you have a generator that produces arrow batches. I suppose in your real use case this generator yields batches from reading the individual csv files? 
In case you don't know: the {{ds.dataset(..)}} also supports reading csv files, which in principle should allow you to write those to parquet as:

{code}
csv_dataset = ds.dataset("...", format="csv")
ds.write_dataset(csv_dataset, "parquet_dst.parquet", format="parquet")
{code}

> Improved Tooling/Documentation on Constructing Larger than Memory Parquet
> -------------------------------------------------------------------------
>
>                 Key: ARROW-14781
>                 URL: https://issues.apache.org/jira/browse/ARROW-14781
>             Project: Apache Arrow
>          Issue Type: Improvement
>            Reporter: Damien Ready
>            Priority: Minor
>
> I have ~800GBs of csvs distributed across ~1200 files and a mere 32GB of RAM. My objective is to incrementally build a parquet dataset holding the collection. I can only hold a small subset of the data in memory.
> Following the docs as best I could, I was able to hack together a workflow that will do what I need, but it seems overly complex. I hope my problem is not out of scope, so I would love it if there was an effort to:
> 1) streamline the APIs to make this more straight-forward
> 2) better documentation on how to approach this problem
> 3) out of the box CLI utilities that would do this without any effort on my part
> Expanding on 3), I was imagining something like a `parquet-cat`, `parquet-append`, `parquet-sample`, `parquet-metadata` or similar that would allow interacting with these files from the terminal. As it is, they are just blobs that require additional tooling to get even the barest sense of what is within.
> Reproducible example below. Happy to hear what I missed that would have made this more straight-forward. Or that would also generate the parquet metadata at the same time.
> EDIT: made example generate random dataframes so it can be run directly. Was to close to my use case where I was reading files from disk
> {code:python}
> import itertools
> import numpy as np
> import pandas as pd
> import pyarrow as pa
> import pyarrow.dataset as ds
> def gen_batches():
>     NUM_CSV_FILES = 15
>     NUM_ROWS = 25
>     for _ in range(NUM_CSV_FILES):
>         dataf = pd.DataFrame(np.random.randint(0, 100, size=(NUM_ROWS, 5)), columns=list("abcde"))
>         # PyArrow dataset would only consume batches iterable
>         for batch in pa.Table.from_pandas(dataf).to_batches():
>             yield batch
> batches = gen_batches()
> # using the write_dataset method requires providing the schema, which is not accessible from a batch?
> peek_batch = batches.__next__()
> # needed to build a table to get to the schema
> schema = pa.Table.from_batches([peek_batch]).schema
> # consumed the first entry of the generator, rebuild it here
> renew_gen_batches = itertools.chain([peek_batch], batches)
> ds.write_dataset(renew_gen_batches, base_dir="parquet_dst.parquet", format="parquet", schema=schema)
> # attempting write_dataset with an iterable of Tables threw: pyarrow.lib.ArrowTypeError: Could not unwrap RecordBatch from Python object of type 'pyarrow.lib.Table'
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)