You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by John Omernik <jo...@omernik.com> on 2013/11/13 00:15:51 UTC

ORC Tuning - Examples?

I am looking for guidance (read examples) on tuning ORC settings for my
data.  I see the documentation that shows the defaults, as well as a brief
description of what it is.  What I am looking for is some examples of
things to try.  *Note: I understand that nobody wants to make sweeping
declaring of set this setting without knowing the data*  That said, I would
love to see some examples, specifically around:

orc.row.index.stride

orc.compress.size

orc.stripe.size


For example, I'd love to see some statements like:


If your data has lots of columns of small data, and you'd like better x,
try changing y setting because this allows hive to do z when querying.


If your data has few columns of large data, try changing y and this allows
hive to do z while querying.


It would be really neat to see some examples so we can get in and tune our
data. Right now, everything is a crapshoot for me, and I don't know if
there are detrimental affects that may make themselves known later.


Any input would be welcome.

Re: ORC Tuning - Examples?

Posted by Yin Huai <hu...@gmail.com>.

Hi John,

I have not played with the stride length. Based on my understanding of the
code, since the stride length determines the number of rows between index
entries, if you decrease the stride length, you can get more fine-grained
indexes which can potentially help you to skip more unnecessary rows (with
predicate pushdown). I think the benefit of a smaller stride length is
pretty workload dependent. For example, you have a query like this

SELECT c1 FROM tbl WHERE c1 > 10 AND c1 < 20;

If tbl is sorted by c1, you may observe that less amount of data is read
from HDFS when you decrease the stride length. However, if tbl is not
sorted by c1, in the worst case, even if you set the stride length to its
minimum value (i.e. 1000), you still will see that the entire c1 is loaded
from HDFS.

Yin


On Wed, Nov 13, 2013 at 4:44 PM, John Omernik <jo...@omernik.com> wrote:

> Yin -
>
> Fantastic! That is exactly the type of explanation of settings I'd like to
> see. More than just what it does, but the tradeoffs, and how things are
> applied in the real world.  Have you played with the stride length at all?
>
>
> On Wed, Nov 13, 2013 at 1:13 PM, Yin Huai <hu...@gmail.com> wrote:
>
>> Hi John,
>>
>> Here is my experience on the stripe size. For a given table, when the
>> stripe size is increased, the size of a column in a stripe increases, which
>> means the ORC reader can read a column from disks in a more efficient way
>> because the reader can sequentially read more data (assuming the reader and
>> the HDFS block are co-located). But, a larger stripe size may decrease the
>> number of concurrent Map tasks reading an ORC file because a Map task needs
>> to process at least one stripe (seems a stripe is not splitable right now).
>> If you can get enough degree of parallelism, I think increasing the stripe
>> size generally gives you better data reading efficiency in one task.
>> However, on HDDs, the benefit from increasing the stripe size on data
>> reading efficiency in a Map task is getting smaller with the increase of
>> the stripe size. So, for a table with only a few columns (assuming a single
>> ORC file is used), using a smaller stripe size may not significantly affect
>> data reading efficiency in a task, and you can potentially have more
>> concurrent tasks to read this ORC file. So, I think you need to tradeoff
>> the data reading efficiency in a single task (larger stripe size -> better
>> data reading efficiency in a task) and the degree of parallelism (smaller
>> stripe size -> more concurrent tasks to read an ORC file) when determining
>> the right stripe size.
>>
>> btw, I have a paper studying file formats and it has some related
>> contents. Here is the link:
>> http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-13-5.pdf
>> .
>>
>> Thanks,
>>
>> Yin
>>
>>
>> On Tue, Nov 12, 2013 at 8:51 PM, Lefty Leverenz <le...@gmail.com>wrote:
>>
>>> If you get some useful advice, let's improve the doc.
>>>
>>> -- Lefty
>>>
>>>
>>> On Tue, Nov 12, 2013 at 6:15 PM, John Omernik <jo...@omernik.com> wrote:
>>>
>>>> I am looking for guidance (read examples) on tuning ORC settings for my
>>>> data.  I see the documentation that shows the defaults, as well as a brief
>>>> description of what it is.  What I am looking for is some examples of
>>>> things to try.  *Note: I understand that nobody wants to make sweeping
>>>> declaring of set this setting without knowing the data*  That said, I would
>>>> love to see some examples, specifically around:
>>>>
>>>> orc.row.index.stride
>>>>
>>>> orc.compress.size
>>>>
>>>> orc.stripe.size
>>>>
>>>>
>>>> For example, I'd love to see some statements like:
>>>>
>>>>
>>>> If your data has lots of columns of small data, and you'd like better
>>>> x, try changing y setting because this allows hive to do z when querying.
>>>>
>>>>
>>>> If your data has few columns of large data, try changing y and this
>>>> allows hive to do z while querying.
>>>>
>>>>
>>>> It would be really neat to see some examples so we can get in and tune
>>>> our data. Right now, everything is a crapshoot for me, and I don't know if
>>>> there are detrimental affects that may make themselves known later.
>>>>
>>>>
>>>> Any input would be welcome.
>>>>
>>>
>>>
>>
>

Re: ORC Tuning - Examples?

Posted by John Omernik <jo...@omernik.com>.

Yin -

Fantastic! That is exactly the type of explanation of settings I'd like to
see. More than just what it does, but the tradeoffs, and how things are
applied in the real world.  Have you played with the stride length at all?


On Wed, Nov 13, 2013 at 1:13 PM, Yin Huai <hu...@gmail.com> wrote:

> Hi John,
>
> Here is my experience on the stripe size. For a given table, when the
> stripe size is increased, the size of a column in a stripe increases, which
> means the ORC reader can read a column from disks in a more efficient way
> because the reader can sequentially read more data (assuming the reader and
> the HDFS block are co-located). But, a larger stripe size may decrease the
> number of concurrent Map tasks reading an ORC file because a Map task needs
> to process at least one stripe (seems a stripe is not splitable right now).
> If you can get enough degree of parallelism, I think increasing the stripe
> size generally gives you better data reading efficiency in one task.
> However, on HDDs, the benefit from increasing the stripe size on data
> reading efficiency in a Map task is getting smaller with the increase of
> the stripe size. So, for a table with only a few columns (assuming a single
> ORC file is used), using a smaller stripe size may not significantly affect
> data reading efficiency in a task, and you can potentially have more
> concurrent tasks to read this ORC file. So, I think you need to tradeoff
> the data reading efficiency in a single task (larger stripe size -> better
> data reading efficiency in a task) and the degree of parallelism (smaller
> stripe size -> more concurrent tasks to read an ORC file) when determining
> the right stripe size.
>
> btw, I have a paper studying file formats and it has some related
> contents. Here is the link:
> http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-13-5.pdf
> .
>
> Thanks,
>
> Yin
>
>
> On Tue, Nov 12, 2013 at 8:51 PM, Lefty Leverenz <le...@gmail.com>wrote:
>
>> If you get some useful advice, let's improve the doc.
>>
>> -- Lefty
>>
>>
>> On Tue, Nov 12, 2013 at 6:15 PM, John Omernik <jo...@omernik.com> wrote:
>>
>>> I am looking for guidance (read examples) on tuning ORC settings for my
>>> data.  I see the documentation that shows the defaults, as well as a brief
>>> description of what it is.  What I am looking for is some examples of
>>> things to try.  *Note: I understand that nobody wants to make sweeping
>>> declaring of set this setting without knowing the data*  That said, I would
>>> love to see some examples, specifically around:
>>>
>>> orc.row.index.stride
>>>
>>> orc.compress.size
>>>
>>> orc.stripe.size
>>>
>>>
>>> For example, I'd love to see some statements like:
>>>
>>>
>>> If your data has lots of columns of small data, and you'd like better x,
>>> try changing y setting because this allows hive to do z when querying.
>>>
>>>
>>> If your data has few columns of large data, try changing y and this
>>> allows hive to do z while querying.
>>>
>>>
>>> It would be really neat to see some examples so we can get in and tune
>>> our data. Right now, everything is a crapshoot for me, and I don't know if
>>> there are detrimental affects that may make themselves known later.
>>>
>>>
>>> Any input would be welcome.
>>>
>>
>>
>

Re: ORC Tuning - Examples?

Posted by Yin Huai <hu...@gmail.com>.

Hi John,

Here is my experience on the stripe size. For a given table, when the
stripe size is increased, the size of a column in a stripe increases, which
means the ORC reader can read a column from disks in a more efficient way
because the reader can sequentially read more data (assuming the reader and
the HDFS block are co-located). But, a larger stripe size may decrease the
number of concurrent Map tasks reading an ORC file because a Map task needs
to process at least one stripe (seems a stripe is not splitable right now).
If you can get enough degree of parallelism, I think increasing the stripe
size generally gives you better data reading efficiency in one task.
However, on HDDs, the benefit from increasing the stripe size on data
reading efficiency in a Map task is getting smaller with the increase of
the stripe size. So, for a table with only a few columns (assuming a single
ORC file is used), using a smaller stripe size may not significantly affect
data reading efficiency in a task, and you can potentially have more
concurrent tasks to read this ORC file. So, I think you need to tradeoff
the data reading efficiency in a single task (larger stripe size -> better
data reading efficiency in a task) and the degree of parallelism (smaller
stripe size -> more concurrent tasks to read an ORC file) when determining
the right stripe size.

btw, I have a paper studying file formats and it has some related contents.
Here is the link:
http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-13-5.pdf.

Thanks,

Yin

On Tue, Nov 12, 2013 at 8:51 PM, Lefty Leverenz <le...@gmail.com>wrote:

> If you get some useful advice, let's improve the doc.
>
> -- Lefty
>
>
> On Tue, Nov 12, 2013 at 6:15 PM, John Omernik <jo...@omernik.com> wrote:
>
>> I am looking for guidance (read examples) on tuning ORC settings for my
>> data.  I see the documentation that shows the defaults, as well as a brief
>> description of what it is.  What I am looking for is some examples of
>> things to try.  *Note: I understand that nobody wants to make sweeping
>> declaring of set this setting without knowing the data*  That said, I would
>> love to see some examples, specifically around:
>>
>> orc.row.index.stride
>>
>> orc.compress.size
>>
>> orc.stripe.size
>>
>>
>> For example, I'd love to see some statements like:
>>
>>
>> If your data has lots of columns of small data, and you'd like better x,
>> try changing y setting because this allows hive to do z when querying.
>>
>>
>> If your data has few columns of large data, try changing y and this
>> allows hive to do z while querying.
>>
>>
>> It would be really neat to see some examples so we can get in and tune
>> our data. Right now, everything is a crapshoot for me, and I don't know if
>> there are detrimental affects that may make themselves known later.
>>
>>
>> Any input would be welcome.
>>
>
>

Re: ORC Tuning - Examples?

Posted by Lefty Leverenz <le...@gmail.com>.

If you get some useful advice, let's improve the doc.

-- Lefty


On Tue, Nov 12, 2013 at 6:15 PM, John Omernik <jo...@omernik.com> wrote:

> I am looking for guidance (read examples) on tuning ORC settings for my
> data.  I see the documentation that shows the defaults, as well as a brief
> description of what it is.  What I am looking for is some examples of
> things to try.  *Note: I understand that nobody wants to make sweeping
> declaring of set this setting without knowing the data*  That said, I would
> love to see some examples, specifically around:
>
> orc.row.index.stride
>
> orc.compress.size
>
> orc.stripe.size
>
>
> For example, I'd love to see some statements like:
>
>
> If your data has lots of columns of small data, and you'd like better x,
> try changing y setting because this allows hive to do z when querying.
>
>
> If your data has few columns of large data, try changing y and this allows
> hive to do z while querying.
>
>
> It would be really neat to see some examples so we can get in and tune our
> data. Right now, everything is a crapshoot for me, and I don't know if
> there are detrimental affects that may make themselves known later.
>
>
> Any input would be welcome.
>