You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@arrow.apache.org by Lei Xu <le...@eto.ai> on 2022/12/09 23:09:04 UTC
[Python] pyarrow.dataset.dataset(source) does not take RecordBatchReader as its document states.
>
>
> Hello,
I was trying to write a larger-than-memory dataset via
PyArrow.write_dataset().
I attempted creating a PyArrow dataset via RecordBatchReader, or a
generator of RecordBatch, to allow writing batches one by one in memory.
def _record_batch_gen() -> Generator[pa.RecordBatch, None, None]:
for i in range(VERY_LARGE_NUMBER):
yield pa.RecordBatch(arrs, names)
batch_reader = pa.RecordBatchReader.from_batches(schema, _record_batch_gen())
dataset = pa.dataset.dataset(batch_reader)
Pyarrow raises an exception like
File "/home/lei/work/lance/python/./lance/data/convert/imagenet.py", line
79, in convert_imagenet_1k
dataset = pa.dataset.dataset(batch_reader)
File
"/home/lei/work/lance/python/venv/lib/python3.10/site-packages/pyarrow/dataset.py",
line 772, in the dataset
raise TypeError(
TypeError: Expected a path-like, list of path-likes or a list of Datasets
instead of the given type: RecordBatchReader
The Pyarrow doc suggests that `pyarrow.dataset.dataset(source)` can
actually take RecordBatchReader
https://arrow.apache.org/docs/python/generated/pyarrow.dataset.dataset.html#pyarrow.dataset.dataset
Would appreciate any suggestion.
Best,
--
Lei Xu