You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/12/20 23:48:36 UTC

[GitHub] [arrow] KaixiangLin opened a new issue #12008: [Question][Python] How to create a large single chunk file without loading all tables into the memory

KaixiangLin opened a new issue #12008:
URL: https://github.com/apache/arrow/issues/12008


   Hello, 
   
   We are looking for an approach to create a single chunk table due to the issue [here](https://issues.apache.org/jira/browse/ARROW-11989). Single chunk table would be much faster during indexing. 
   
   Currently, we write the the table by first load all files, convert to tables and then combine chunks. 
   ```python
   for ds_file in all_datasets:
           ds = pa.dataset.dataset(ds_file, format='feather')
           train_datasets.append(ds.to_table())
   combined_table = pa.concat_tables(train_datasets).combine_chunks()
   table = combined_table.cast(schema)
   with open(output_filename, "wb") as f:
         s = pa.ipc.new_stream(
             f, schema, options=pa.ipc.IpcWriteOptions(allow_64bit=True)
         )
         batches = table.to_batches()
         s.write_batch(batches[0]) 
   ```
   However, this approach takes memory size 2x original dataset size.  I wonder if there is a way to write the dataset one by one 
   but still ensure the single chunk? 
   
   Thank you! 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] KaixiangLin closed issue #12008: [Question][Python] How to create a large single chunk file without loading all tables into the memory

Posted by GitBox <gi...@apache.org>.
KaixiangLin closed issue #12008:
URL: https://github.com/apache/arrow/issues/12008


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org