You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@orc.apache.org by Ben Johnson <be...@timber.io> on 2017/01/19 23:22:42 UTC

What is the proper way to merge small ORC files?

Hi, I'm new to this group, thank you for the help in advance. I had a question
about merging ORC files via a CONCATENATE query like so:
ALTER TABLE my_table [PARTITION partition_spec] CONCATENATE;
We have thousands of small ORC files whose data is ordered by a date attribute.
My questions are:
 1. If we merge these files using the above query, will the ordering be
    preserved?
 2. And will the file be structured in a way that it will take full advantage of
    the indexes?
 3. Or are we better served transitioning the data through a traditional INSERT
    query from a second table?




Thanks!
BenTimber.io -Blog -Github -Twitter

Re: What is the proper way to merge small ORC files?

Posted by Ben Johnson <be...@timber.io>.
Great! This is exactly what I needed. Thank you for taking the time to respond.
BenTimber.io -Blog -Github -Twitter
 





On Thu, Jan 19, 2017 5:57 PM, Gopal Vijayaraghavan gopalv@apache.org
wrote:



> 1. If we merge these files using the above query, will the ordering be
preserved?




The merge does not re-compress, but it will concatenate files (literally) &
write a new footer index.




So most characteristics of the original ordering will be maintained within each
stripe, though the file level indexes will change.




> 2. And will the file be structured in a way that it will take full advantage
of the indexes?




Yes, well as much as the original files.




The row-group index is not affected at all and will work exactly as before &
same for the stripe elimination via PPD.




> 3. Or are we better served transitioning the data through a traditional INSERT
query from a second table?




Not always, particularly if you're dealing with good performance right now.




You can get much better compression with a reinsert, but that might be a
disadvantage in some cases (since compressed stripe forms an indivisible split,
reducing the stripe # might reduce splits).




More compression == fewer cpus to process the same data. So that's a trade-off.




Cheers,

Gopal

Re: What is the proper way to merge small ORC files?

Posted by Gopal Vijayaraghavan <go...@apache.org>.
> 1. If we merge these files using the above query, will the ordering be preserved?

The merge does not re-compress, but it will concatenate files (literally) & write a new footer index.

So most characteristics of the original ordering will be maintained within each stripe, though the file level indexes will change.

> 2. And will the file be structured in a way that it will take full advantage of the indexes?

Yes, well as much as the original files.

The row-group index is not affected at all and will work exactly as before & same for the stripe elimination via PPD.

> 3. Or are we better served transitioning the data through a traditional INSERT query from a second table?

Not always, particularly if you're dealing with good performance right now.

You can get much better compression with a reinsert, but that might be a disadvantage in some cases (since compressed stripe forms an indivisible split, reducing the stripe # might reduce splits).

More compression == fewer cpus to process the same data. So that's a trade-off.

Cheers,
Gopal