You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@orc.apache.org by satyajit vegesna <sa...@gmail.com> on 2022/07/08 22:56:01 UTC

Can multiple small ORC files combined, without having to write all files into a big file?

Hi Community,

I would like to understand if there is already a way to combine two small
files, without having to read and write both the files into a single file.

This will save a lot of time when we have too many small files to combine
into one single file.

Is it because of the internal metadata and structure that holds to combine
files? or any other reason.

Regards.

Re: Can multiple small ORC files combined, without having to write all files into a big file?

Posted by Dongjoon Hyun <do...@gmail.com>.
Hi, I added my opinions.

> I would like to understand if there is already a way to combine two small files

`orc-tools` convert command already supports simple merging like the following.

    orc-tools convert 1.orc 2.orc -o merged.orc

> without having to read and write both the files into a single file.

At least, you need to open and read the bytes of the source files in
the stripe level. Probably, you missed your specific requirement in
this question.

> This will save a lot of time when we have too many small files to combine into one single file.

Let me assume that you aim for `Stripe`-level concatenation instead of
reading columns or records.
In the case of many small files, the above claim is the usual first
approach and true in some cases.
However, in my experience, it could be badly wrong in many tiny ORC
file cases because stripes are the unit of compression and processing.
The simply-merged single gigantic ORC file may be not the best you can imagine.

I'd recommend trying both the compaction style (read/sort/write) and
the simple stripe-level concatenation together first.
After the real experiment, you can choose based on data about
- Your ORC data content characteristic
- File size reduction ratio after processing
  (Not only the columnar encoding benefit but also you may want to
switch the compression codec too)
- Your downstream consumers' usage pattern (access frequency and input
split handling)

BTW, these days, you had better consider Apache Iceberg which provides
more features on top of ORC file level features.

Dongjoon.

On Fri, Jul 8, 2022 at 7:46 PM satyajit vegesna
<sa...@gmail.com> wrote:
>
> Hi Community,
>
> I would like to understand if there is already a way to combine two small
> files, without having to read and write both the files into a single file.
>
> This will save a lot of time when we have too many small files to combine
> into one single file.
>
> Is it because of the internal metadata and structure that holds to combine
> files? or any other reason.
>
> Regards.