You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/12/20 23:48:36 UTC
[GitHub] [arrow] KaixiangLin opened a new issue #12008: [Question][Python] How to create a large single chunk file without loading all tables into the memory
KaixiangLin opened a new issue #12008:
URL: https://github.com/apache/arrow/issues/12008
Hello,
We are looking for an approach to create a single chunk table due to the issue [here](https://issues.apache.org/jira/browse/ARROW-11989). Single chunk table would be much faster during indexing.
Currently, we write the the table by first load all files, convert to tables and then combine chunks.
```python
for ds_file in all_datasets:
ds = pa.dataset.dataset(ds_file, format='feather')
train_datasets.append(ds.to_table())
combined_table = pa.concat_tables(train_datasets).combine_chunks()
table = combined_table.cast(schema)
with open(output_filename, "wb") as f:
s = pa.ipc.new_stream(
f, schema, options=pa.ipc.IpcWriteOptions(allow_64bit=True)
)
batches = table.to_batches()
s.write_batch(batches[0])
```
However, this approach takes memory size 2x original dataset size. I wonder if there is a way to write the dataset one by one
but still ensure the single chunk?
Thank you!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] KaixiangLin closed issue #12008: [Question][Python] How to create a large single chunk file without loading all tables into the memory
Posted by GitBox <gi...@apache.org>.
KaixiangLin closed issue #12008:
URL: https://github.com/apache/arrow/issues/12008
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org