You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@arrow.apache.org by Lei Xu <le...@eto.ai> on 2022/12/09 23:09:04 UTC

[Python] pyarrow.dataset.dataset(source) does not take RecordBatchReader as its document states.

>
>
> Hello,

I was trying to write a larger-than-memory dataset via
PyArrow.write_dataset().
I attempted creating a PyArrow dataset via RecordBatchReader, or a
generator of RecordBatch, to allow writing batches one by one in memory.

def _record_batch_gen() -> Generator[pa.RecordBatch, None, None]:
   for i in range(VERY_LARGE_NUMBER):
       yield pa.RecordBatch(arrs, names)


batch_reader = pa.RecordBatchReader.from_batches(schema, _record_batch_gen())
dataset = pa.dataset.dataset(batch_reader)


Pyarrow raises an exception like

  File "/home/lei/work/lance/python/./lance/data/convert/imagenet.py", line
79, in convert_imagenet_1k
    dataset = pa.dataset.dataset(batch_reader)
  File
"/home/lei/work/lance/python/venv/lib/python3.10/site-packages/pyarrow/dataset.py",
line 772, in the dataset
    raise TypeError(
TypeError: Expected a path-like, list of path-likes or a list of Datasets
instead of the given type: RecordBatchReader

The Pyarrow doc suggests that `pyarrow.dataset.dataset(source)` can
actually take RecordBatchReader

https://arrow.apache.org/docs/python/generated/pyarrow.dataset.dataset.html#pyarrow.dataset.dataset

Would appreciate any suggestion.

Best,
-- 
Lei Xu