You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@orc.apache.org by Ben Johnson <be...@timber.io> on 2017/01/19 23:22:42 UTC
What is the proper way to merge small ORC files?
Hi, I'm new to this group, thank you for the help in advance. I had a question
about merging ORC files via a CONCATENATE query like so:
ALTER TABLE my_table [PARTITION partition_spec] CONCATENATE;
We have thousands of small ORC files whose data is ordered by a date attribute.
My questions are:
1. If we merge these files using the above query, will the ordering be
preserved?
2. And will the file be structured in a way that it will take full advantage of
the indexes?
3. Or are we better served transitioning the data through a traditional INSERT
query from a second table?
Thanks!
BenTimber.io -Blog -Github -Twitter
Re: What is the proper way to merge small ORC files?
Posted by Ben Johnson <be...@timber.io>.
Great! This is exactly what I needed. Thank you for taking the time to respond.
BenTimber.io -Blog -Github -Twitter
On Thu, Jan 19, 2017 5:57 PM, Gopal Vijayaraghavan gopalv@apache.org
wrote:
> 1. If we merge these files using the above query, will the ordering be
preserved?
The merge does not re-compress, but it will concatenate files (literally) &
write a new footer index.
So most characteristics of the original ordering will be maintained within each
stripe, though the file level indexes will change.
> 2. And will the file be structured in a way that it will take full advantage
of the indexes?
Yes, well as much as the original files.
The row-group index is not affected at all and will work exactly as before &
same for the stripe elimination via PPD.
> 3. Or are we better served transitioning the data through a traditional INSERT
query from a second table?
Not always, particularly if you're dealing with good performance right now.
You can get much better compression with a reinsert, but that might be a
disadvantage in some cases (since compressed stripe forms an indivisible split,
reducing the stripe # might reduce splits).
More compression == fewer cpus to process the same data. So that's a trade-off.
Cheers,
Gopal
Re: What is the proper way to merge small ORC files?
Posted by Gopal Vijayaraghavan <go...@apache.org>.
> 1. If we merge these files using the above query, will the ordering be preserved?
The merge does not re-compress, but it will concatenate files (literally) & write a new footer index.
So most characteristics of the original ordering will be maintained within each stripe, though the file level indexes will change.
> 2. And will the file be structured in a way that it will take full advantage of the indexes?
Yes, well as much as the original files.
The row-group index is not affected at all and will work exactly as before & same for the stripe elimination via PPD.
> 3. Or are we better served transitioning the data through a traditional INSERT query from a second table?
Not always, particularly if you're dealing with good performance right now.
You can get much better compression with a reinsert, but that might be a disadvantage in some cases (since compressed stripe forms an indivisible split, reducing the stripe # might reduce splits).
More compression == fewer cpus to process the same data. So that's a trade-off.
Cheers,
Gopal