You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@orc.apache.org by Eric Wohlstadter <wo...@gmail.com> on 2021/02/09 00:37:09 UTC

Limiting ORC file sizes from Java API

Hi all,
 I am using the Java API to write ORC from a custom Java process to S3. Can
I control the compressed file size output to a certain limit?
For example, let's say I don't want any files to be larger than 50MB.

The flow in my code is:
1. Create an ORC writer: OrcFile.*createWriter*
*2. *Add rows 1000 at a time using VectorizedRowBatch: writer.addRowBatch(
batch)
3. Close the writer when no more data is available to be added: writer
.close();

This creates one file on S3 whose size is determined by however much data
was written + compression.
The API doesn't seem to have any way to pre-determine how much data will be
written before the writer is closed, so I can't use that information to
decide when to close the writer.

If I write 100MB but would like 50MB files, is there a way to configure the
OrcWriter to split into two files based on a size limit?

Followup question: when writing to S3, is each row batch flushed out of
memory when it is added, or is it buffered in memory until the writer is
closed?
My reason for wanting a limit (e.g. 50MB) is to ensure that Java memory is
not being buffered beyond that point (as my Java process has multiple
OrcWriters running concurrently, for multiple input streams). If this
buffering in memory is not an issue, maybe I don't need to worry about
limiting the file size?

Thanks for your help,

Eric