You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/12/02 11:28:11 UTC

[GitHub] [arrow] jorisvandenbossche edited a comment on pull request #9702: ARROW-11297: [C++][Python] Add ORC writer options

jorisvandenbossche edited a comment on pull request #9702:
URL: https://github.com/apache/arrow/pull/9702#issuecomment-984540887


   Reading a bit more about it, I realized now: I think the `stripe_size` is a size in bytes, while I assumed it was number of rows. That's also an additional reason why my example above didn't work. 
   So passing the stripe size as batch size as you did in the last commit is therefore incorrect, I think.
   
   I think this has the following consequences:
   
   - The file is written in batches (our `ORCFileWriter::Write` calls `orcc::Writer::add` multiple times following `kOrcWriterBatchSize`, which is expressed in number of rows), and it seems it is only per batch added that is is checked at the end to write a stripe or not. So that means that you can't create multiple stripes from a single batch? But only add batches until you reach the minimum stripe size and then create a stripe, and the next batches being added will form the next stripe. So this seems to generally assume that batches are smaller than stripes?
   - For this reason, would it make sense to be able to specify `batch_size` as well? Because if you want a smaller stripe size, you might need a smaller batch size as well. Although it is currently set to `128 * 1024`, which seems small enough for practical use?
   
   In practice this also means that you need to create a test dataset that is larger than this default batch size to see the effect of `stripe_size`: (using your branch without the last commit):
   
   ```python
   # table which will be written as two batches
   >>> table = pa.table({'a': np.random.randn((128 * 1024) + 1)})
   # with default stripe size, you still get a single stripe
   >>> orc.write_table(table, "test_orc_size.orc", compression="zlib")
   >>> orc.ORCFile("test_orc_size.orc").nstripes
   1
   # but with small stripe size, you actually get two stripes as expected (setting it to arbitrary low 10bytes for testing)
   >>> orc.write_table(table, "test_orc_size.orc", stripe_size=10, compression="zlib")
   >>>orc.ORCFile("test_orc_size.orc").nstripes
   2
   # and so further increasing the size of the table works as expected
   >>> table = pa.table({'a': np.random.randn((128 * 1024) *2 + 1)})
   >>> orc.write_table(table, "test_orc_size.orc", stripe_size=10, compression="zlib")
   >>>orc.ORCFile("test_orc_size.orc").nstripes
   3
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org