You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hadoop.apache.org by Anthony Mattas <an...@mattas.net> on 2014/03/05 03:31:42 UTC

Benchmarking Hive Changes

I’ve been trying to benchmark some of the Hive enhancements in Hadoop 2.0 using the HDP Sandbox. 

I took one of their example queries and executed it with the tables stored as TEXTFILE, RCFILE, and ORC. I also tried enabling enabling vectorized execution, and predicate pushdown.

SELECT s07.description, s07.salary, s08.salary,
  s08.salary - s07.salary
FROM
  sample_07 s07 JOIN sample_08 s08
ON ( s07.code = s08.code)
WHERE
 s07.salary < s08.salary
SORT BY s08.salary-s07.salary DESC

Ultimately there was not much different performance in any of the executions, can someone clarify for me if I need an actual full cluster to see performance improvements, or if I’m missing something else. I thought at minimum I would have seen an improvement moving to ORC from TEXTFILE.

Re: Benchmarking Hive Changes

Posted by Olivier Renault <or...@hortonworks.com>.

The last iteration of stinger is coming with Tez.

The HDP 2 sandbox that you're using is not including Tez. You can add it
manually if you would like (doc is available on Hortonworks.com/labs) or
it'll be available of the HDP 2.1 sandbox.

Kind regards
Olivier
On 5 Mar 2014 17:15, "Anthony Mattas" <an...@mattas.net> wrote:

> Hi Yong,
>
> I'm confused - I'm using Hive 0.12.0, shouldn't that be using "stinger" by
> default? Or is there configurations that have to be enabled?
>
> Anthony Mattas
> anthony@mattas.net
>
>
> On Wed, Mar 5, 2014 at 11:06 AM, java8964 <ja...@hotmail.com> wrote:
>
>> Your files are too small for any meaningful test of these 3 file types.
>>
>> Most of the 23 seconds are spending on preparing/starting your MR job and
>> shutdown.
>>
>> You need at least Gs data to compare the performance of these 3 types, to
>> get any meaningful result.
>>
>> But as long as it is Hive on top of MapReduce, it will be really hard to
>> archive an "interactive" result. MapReduce is a batch mode, period.
>>
>> You do want to consider Impala/spark or Apache stinger, if you really are
>> looking for "interactive".
>>
>> Yong
>>
>> ------------------------------
>> Date: Wed, 5 Mar 2014 09:02:32 -0500
>> Subject: Re: Benchmarking Hive Changes
>> From: anthony@mattas.net
>> To: user@hadoop.apache.org
>>
>>
>> Yes, I'm using the HortonWorks Data Platform 2.0 Sandbox which is a
>> standalone box.
>>
>> But shame on me it looks like the files are both very tiny (46K), I'm
>> seeing about 23 seconds per query, which appears mostly to be starting up
>> MR.
>>
>> So I'm going to find a new data set and try again, is there any types of
>> optimizations that can be done to reduce the start up time?
>>
>> Ultimately I'm trying to compare the response time in Hive versus an EDW
>> platform - of course I still expect the EDW to perform more performantly,
>> but with the advancements in the newer versions of Hive I'm hoping for at
>> least a reasonable response for a user wishing to do interactive querying.
>> Specifically using Hive, I know you can get really good performance out of
>> Impala, but am not yet interested in going that route.
>>
>> Anthony Mattas
>> anthony@mattas.net
>>
>>
>> On Wed, Mar 5, 2014 at 8:47 AM, java8964 <ja...@hotmail.com> wrote:
>>
>> Are you doing on standalone one box? How large are your test files and
>> how long of the jobs of each type took?
>>
>> Yong
>>
>> > From: anthony@mattas.net
>> > Subject: Benchmarking Hive Changes
>> > Date: Tue, 4 Mar 2014 21:31:42 -0500
>> > To: user@hadoop.apache.org
>>
>> >
>> > I've been trying to benchmark some of the Hive enhancements in Hadoop
>> 2.0 using the HDP Sandbox.
>> >
>> > I took one of their example queries and executed it with the tables
>> stored as TEXTFILE, RCFILE, and ORC. I also tried enabling enabling
>> vectorized execution, and predicate pushdown.
>> >
>> > SELECT s07.description, s07.salary, s08.salary,
>> > s08.salary - s07.salary
>> > FROM
>> > sample_07 s07 JOIN sample_08 s08
>> > ON ( s07.code = s08.code)
>> > WHERE
>> > s07.salary < s08.salary
>> > SORT BY s08.salary-s07.salary DESC
>> >
>> > Ultimately there was not much different performance in any of the
>> executions, can someone clarify for me if I need an actual full cluster to
>> see performance improvements, or if I'm missing something else. I thought
>> at minimum I would have seen an improvement moving to ORC from TEXTFILE.
>>
>>
>>
>

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: Benchmarking Hive Changes

Posted by Olivier Renault <or...@hortonworks.com>.

The last iteration of stinger is coming with Tez.

The HDP 2 sandbox that you're using is not including Tez. You can add it
manually if you would like (doc is available on Hortonworks.com/labs) or
it'll be available of the HDP 2.1 sandbox.

Kind regards
Olivier
On 5 Mar 2014 17:15, "Anthony Mattas" <an...@mattas.net> wrote:

> Hi Yong,
>
> I'm confused - I'm using Hive 0.12.0, shouldn't that be using "stinger" by
> default? Or is there configurations that have to be enabled?
>
> Anthony Mattas
> anthony@mattas.net
>
>
> On Wed, Mar 5, 2014 at 11:06 AM, java8964 <ja...@hotmail.com> wrote:
>
>> Your files are too small for any meaningful test of these 3 file types.
>>
>> Most of the 23 seconds are spending on preparing/starting your MR job and
>> shutdown.
>>
>> You need at least Gs data to compare the performance of these 3 types, to
>> get any meaningful result.
>>
>> But as long as it is Hive on top of MapReduce, it will be really hard to
>> archive an "interactive" result. MapReduce is a batch mode, period.
>>
>> You do want to consider Impala/spark or Apache stinger, if you really are
>> looking for "interactive".
>>
>> Yong
>>
>> ------------------------------
>> Date: Wed, 5 Mar 2014 09:02:32 -0500
>> Subject: Re: Benchmarking Hive Changes
>> From: anthony@mattas.net
>> To: user@hadoop.apache.org
>>
>>
>> Yes, I'm using the HortonWorks Data Platform 2.0 Sandbox which is a
>> standalone box.
>>
>> But shame on me it looks like the files are both very tiny (46K), I'm
>> seeing about 23 seconds per query, which appears mostly to be starting up
>> MR.
>>
>> So I'm going to find a new data set and try again, is there any types of
>> optimizations that can be done to reduce the start up time?
>>
>> Ultimately I'm trying to compare the response time in Hive versus an EDW
>> platform - of course I still expect the EDW to perform more performantly,
>> but with the advancements in the newer versions of Hive I'm hoping for at
>> least a reasonable response for a user wishing to do interactive querying.
>> Specifically using Hive, I know you can get really good performance out of
>> Impala, but am not yet interested in going that route.
>>
>> Anthony Mattas
>> anthony@mattas.net
>>
>>
>> On Wed, Mar 5, 2014 at 8:47 AM, java8964 <ja...@hotmail.com> wrote:
>>
>> Are you doing on standalone one box? How large are your test files and
>> how long of the jobs of each type took?
>>
>> Yong
>>
>> > From: anthony@mattas.net
>> > Subject: Benchmarking Hive Changes
>> > Date: Tue, 4 Mar 2014 21:31:42 -0500
>> > To: user@hadoop.apache.org
>>
>> >
>> > I've been trying to benchmark some of the Hive enhancements in Hadoop
>> 2.0 using the HDP Sandbox.
>> >
>> > I took one of their example queries and executed it with the tables
>> stored as TEXTFILE, RCFILE, and ORC. I also tried enabling enabling
>> vectorized execution, and predicate pushdown.
>> >
>> > SELECT s07.description, s07.salary, s08.salary,
>> > s08.salary - s07.salary
>> > FROM
>> > sample_07 s07 JOIN sample_08 s08
>> > ON ( s07.code = s08.code)
>> > WHERE
>> > s07.salary < s08.salary
>> > SORT BY s08.salary-s07.salary DESC
>> >
>> > Ultimately there was not much different performance in any of the
>> executions, can someone clarify for me if I need an actual full cluster to
>> see performance improvements, or if I'm missing something else. I thought
>> at minimum I would have seen an improvement moving to ORC from TEXTFILE.
>>
>>
>>
>

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: Benchmarking Hive Changes

Posted by Olivier Renault <or...@hortonworks.com>.

The last iteration of stinger is coming with Tez.

The HDP 2 sandbox that you're using is not including Tez. You can add it
manually if you would like (doc is available on Hortonworks.com/labs) or
it'll be available of the HDP 2.1 sandbox.

Kind regards
Olivier
On 5 Mar 2014 17:15, "Anthony Mattas" <an...@mattas.net> wrote:

> Hi Yong,
>
> I'm confused - I'm using Hive 0.12.0, shouldn't that be using "stinger" by
> default? Or is there configurations that have to be enabled?
>
> Anthony Mattas
> anthony@mattas.net
>
>
> On Wed, Mar 5, 2014 at 11:06 AM, java8964 <ja...@hotmail.com> wrote:
>
>> Your files are too small for any meaningful test of these 3 file types.
>>
>> Most of the 23 seconds are spending on preparing/starting your MR job and
>> shutdown.
>>
>> You need at least Gs data to compare the performance of these 3 types, to
>> get any meaningful result.
>>
>> But as long as it is Hive on top of MapReduce, it will be really hard to
>> archive an "interactive" result. MapReduce is a batch mode, period.
>>
>> You do want to consider Impala/spark or Apache stinger, if you really are
>> looking for "interactive".
>>
>> Yong
>>
>> ------------------------------
>> Date: Wed, 5 Mar 2014 09:02:32 -0500
>> Subject: Re: Benchmarking Hive Changes
>> From: anthony@mattas.net
>> To: user@hadoop.apache.org
>>
>>
>> Yes, I'm using the HortonWorks Data Platform 2.0 Sandbox which is a
>> standalone box.
>>
>> But shame on me it looks like the files are both very tiny (46K), I'm
>> seeing about 23 seconds per query, which appears mostly to be starting up
>> MR.
>>
>> So I'm going to find a new data set and try again, is there any types of
>> optimizations that can be done to reduce the start up time?
>>
>> Ultimately I'm trying to compare the response time in Hive versus an EDW
>> platform - of course I still expect the EDW to perform more performantly,
>> but with the advancements in the newer versions of Hive I'm hoping for at
>> least a reasonable response for a user wishing to do interactive querying.
>> Specifically using Hive, I know you can get really good performance out of
>> Impala, but am not yet interested in going that route.
>>
>> Anthony Mattas
>> anthony@mattas.net
>>
>>
>> On Wed, Mar 5, 2014 at 8:47 AM, java8964 <ja...@hotmail.com> wrote:
>>
>> Are you doing on standalone one box? How large are your test files and
>> how long of the jobs of each type took?
>>
>> Yong
>>
>> > From: anthony@mattas.net
>> > Subject: Benchmarking Hive Changes
>> > Date: Tue, 4 Mar 2014 21:31:42 -0500
>> > To: user@hadoop.apache.org
>>
>> >
>> > I've been trying to benchmark some of the Hive enhancements in Hadoop
>> 2.0 using the HDP Sandbox.
>> >
>> > I took one of their example queries and executed it with the tables
>> stored as TEXTFILE, RCFILE, and ORC. I also tried enabling enabling
>> vectorized execution, and predicate pushdown.
>> >
>> > SELECT s07.description, s07.salary, s08.salary,
>> > s08.salary - s07.salary
>> > FROM
>> > sample_07 s07 JOIN sample_08 s08
>> > ON ( s07.code = s08.code)
>> > WHERE
>> > s07.salary < s08.salary
>> > SORT BY s08.salary-s07.salary DESC
>> >
>> > Ultimately there was not much different performance in any of the
>> executions, can someone clarify for me if I need an actual full cluster to
>> see performance improvements, or if I'm missing something else. I thought
>> at minimum I would have seen an improvement moving to ORC from TEXTFILE.
>>
>>
>>
>

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: Benchmarking Hive Changes

Posted by Olivier Renault <or...@hortonworks.com>.

The last iteration of stinger is coming with Tez.

The HDP 2 sandbox that you're using is not including Tez. You can add it
manually if you would like (doc is available on Hortonworks.com/labs) or
it'll be available of the HDP 2.1 sandbox.

Kind regards
Olivier
On 5 Mar 2014 17:15, "Anthony Mattas" <an...@mattas.net> wrote:

> Hi Yong,
>
> I'm confused - I'm using Hive 0.12.0, shouldn't that be using "stinger" by
> default? Or is there configurations that have to be enabled?
>
> Anthony Mattas
> anthony@mattas.net
>
>
> On Wed, Mar 5, 2014 at 11:06 AM, java8964 <ja...@hotmail.com> wrote:
>
>> Your files are too small for any meaningful test of these 3 file types.
>>
>> Most of the 23 seconds are spending on preparing/starting your MR job and
>> shutdown.
>>
>> You need at least Gs data to compare the performance of these 3 types, to
>> get any meaningful result.
>>
>> But as long as it is Hive on top of MapReduce, it will be really hard to
>> archive an "interactive" result. MapReduce is a batch mode, period.
>>
>> You do want to consider Impala/spark or Apache stinger, if you really are
>> looking for "interactive".
>>
>> Yong
>>
>> ------------------------------
>> Date: Wed, 5 Mar 2014 09:02:32 -0500
>> Subject: Re: Benchmarking Hive Changes
>> From: anthony@mattas.net
>> To: user@hadoop.apache.org
>>
>>
>> Yes, I'm using the HortonWorks Data Platform 2.0 Sandbox which is a
>> standalone box.
>>
>> But shame on me it looks like the files are both very tiny (46K), I'm
>> seeing about 23 seconds per query, which appears mostly to be starting up
>> MR.
>>
>> So I'm going to find a new data set and try again, is there any types of
>> optimizations that can be done to reduce the start up time?
>>
>> Ultimately I'm trying to compare the response time in Hive versus an EDW
>> platform - of course I still expect the EDW to perform more performantly,
>> but with the advancements in the newer versions of Hive I'm hoping for at
>> least a reasonable response for a user wishing to do interactive querying.
>> Specifically using Hive, I know you can get really good performance out of
>> Impala, but am not yet interested in going that route.
>>
>> Anthony Mattas
>> anthony@mattas.net
>>
>>
>> On Wed, Mar 5, 2014 at 8:47 AM, java8964 <ja...@hotmail.com> wrote:
>>
>> Are you doing on standalone one box? How large are your test files and
>> how long of the jobs of each type took?
>>
>> Yong
>>
>> > From: anthony@mattas.net
>> > Subject: Benchmarking Hive Changes
>> > Date: Tue, 4 Mar 2014 21:31:42 -0500
>> > To: user@hadoop.apache.org
>>
>> >
>> > I've been trying to benchmark some of the Hive enhancements in Hadoop
>> 2.0 using the HDP Sandbox.
>> >
>> > I took one of their example queries and executed it with the tables
>> stored as TEXTFILE, RCFILE, and ORC. I also tried enabling enabling
>> vectorized execution, and predicate pushdown.
>> >
>> > SELECT s07.description, s07.salary, s08.salary,
>> > s08.salary - s07.salary
>> > FROM
>> > sample_07 s07 JOIN sample_08 s08
>> > ON ( s07.code = s08.code)
>> > WHERE
>> > s07.salary < s08.salary
>> > SORT BY s08.salary-s07.salary DESC
>> >
>> > Ultimately there was not much different performance in any of the
>> executions, can someone clarify for me if I need an actual full cluster to
>> see performance improvements, or if I'm missing something else. I thought
>> at minimum I would have seen an improvement moving to ORC from TEXTFILE.
>>
>>
>>
>

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: Benchmarking Hive Changes

Posted by Anthony Mattas <an...@mattas.net>.

Hi Yong,

I'm confused - I'm using Hive 0.12.0, shouldn't that be using "stinger" by
default? Or is there configurations that have to be enabled? 

Anthony Mattas
anthony@mattas.net


On Wed, Mar 5, 2014 at 11:06 AM, java8964 <ja...@hotmail.com> wrote:

> Your files are too small for any meaningful test of these 3 file types.
>
> Most of the 23 seconds are spending on preparing/starting your MR job and
> shutdown.
>
> You need at least Gs data to compare the performance of these 3 types, to
> get any meaningful result.
>
> But as long as it is Hive on top of MapReduce, it will be really hard to
> archive an "interactive" result. MapReduce is a batch mode, period.
>
> You do want to consider Impala/spark or Apache stinger, if you really are
> looking for "interactive".
>
> Yong
>
> ------------------------------
> Date: Wed, 5 Mar 2014 09:02:32 -0500
> Subject: Re: Benchmarking Hive Changes
> From: anthony@mattas.net
> To: user@hadoop.apache.org
>
>
> Yes, I'm using the HortonWorks Data Platform 2.0 Sandbox which is a
> standalone box.
>
> But shame on me it looks like the files are both very tiny (46K), I'm
> seeing about 23 seconds per query, which appears mostly to be starting up
> MR.
>
> So I'm going to find a new data set and try again, is there any types of
> optimizations that can be done to reduce the start up time?
>
> Ultimately I'm trying to compare the response time in Hive versus an EDW
> platform - of course I still expect the EDW to perform more performantly,
> but with the advancements in the newer versions of Hive I'm hoping for at
> least a reasonable response for a user wishing to do interactive querying.
> Specifically using Hive, I know you can get really good performance out of
> Impala, but am not yet interested in going that route.
>
> Anthony Mattas
> anthony@mattas.net
>
>
> On Wed, Mar 5, 2014 at 8:47 AM, java8964 <ja...@hotmail.com> wrote:
>
> Are you doing on standalone one box? How large are your test files and how
> long of the jobs of each type took?
>
> Yong
>
> > From: anthony@mattas.net
> > Subject: Benchmarking Hive Changes
> > Date: Tue, 4 Mar 2014 21:31:42 -0500
> > To: user@hadoop.apache.org
>
> >
> > I've been trying to benchmark some of the Hive enhancements in Hadoop
> 2.0 using the HDP Sandbox.
> >
> > I took one of their example queries and executed it with the tables
> stored as TEXTFILE, RCFILE, and ORC. I also tried enabling enabling
> vectorized execution, and predicate pushdown.
> >
> > SELECT s07.description, s07.salary, s08.salary,
> > s08.salary - s07.salary
> > FROM
> > sample_07 s07 JOIN sample_08 s08
> > ON ( s07.code = s08.code)
> > WHERE
> > s07.salary < s08.salary
> > SORT BY s08.salary-s07.salary DESC
> >
> > Ultimately there was not much different performance in any of the
> executions, can someone clarify for me if I need an actual full cluster to
> see performance improvements, or if I'm missing something else. I thought
> at minimum I would have seen an improvement moving to ORC from TEXTFILE.
>
>
>

Re: Benchmarking Hive Changes

Posted by Anthony Mattas <an...@mattas.net>.

Hi Yong,

I'm confused - I'm using Hive 0.12.0, shouldn't that be using "stinger" by
default? Or is there configurations that have to be enabled? 

Anthony Mattas
anthony@mattas.net


On Wed, Mar 5, 2014 at 11:06 AM, java8964 <ja...@hotmail.com> wrote:

> Your files are too small for any meaningful test of these 3 file types.
>
> Most of the 23 seconds are spending on preparing/starting your MR job and
> shutdown.
>
> You need at least Gs data to compare the performance of these 3 types, to
> get any meaningful result.
>
> But as long as it is Hive on top of MapReduce, it will be really hard to
> archive an "interactive" result. MapReduce is a batch mode, period.
>
> You do want to consider Impala/spark or Apache stinger, if you really are
> looking for "interactive".
>
> Yong
>
> ------------------------------
> Date: Wed, 5 Mar 2014 09:02:32 -0500
> Subject: Re: Benchmarking Hive Changes
> From: anthony@mattas.net
> To: user@hadoop.apache.org
>
>
> Yes, I'm using the HortonWorks Data Platform 2.0 Sandbox which is a
> standalone box.
>
> But shame on me it looks like the files are both very tiny (46K), I'm
> seeing about 23 seconds per query, which appears mostly to be starting up
> MR.
>
> So I'm going to find a new data set and try again, is there any types of
> optimizations that can be done to reduce the start up time?
>
> Ultimately I'm trying to compare the response time in Hive versus an EDW
> platform - of course I still expect the EDW to perform more performantly,
> but with the advancements in the newer versions of Hive I'm hoping for at
> least a reasonable response for a user wishing to do interactive querying.
> Specifically using Hive, I know you can get really good performance out of
> Impala, but am not yet interested in going that route.
>
> Anthony Mattas
> anthony@mattas.net
>
>
> On Wed, Mar 5, 2014 at 8:47 AM, java8964 <ja...@hotmail.com> wrote:
>
> Are you doing on standalone one box? How large are your test files and how
> long of the jobs of each type took?
>
> Yong
>
> > From: anthony@mattas.net
> > Subject: Benchmarking Hive Changes
> > Date: Tue, 4 Mar 2014 21:31:42 -0500
> > To: user@hadoop.apache.org
>
> >
> > I've been trying to benchmark some of the Hive enhancements in Hadoop
> 2.0 using the HDP Sandbox.
> >
> > I took one of their example queries and executed it with the tables
> stored as TEXTFILE, RCFILE, and ORC. I also tried enabling enabling
> vectorized execution, and predicate pushdown.
> >
> > SELECT s07.description, s07.salary, s08.salary,
> > s08.salary - s07.salary
> > FROM
> > sample_07 s07 JOIN sample_08 s08
> > ON ( s07.code = s08.code)
> > WHERE
> > s07.salary < s08.salary
> > SORT BY s08.salary-s07.salary DESC
> >
> > Ultimately there was not much different performance in any of the
> executions, can someone clarify for me if I need an actual full cluster to
> see performance improvements, or if I'm missing something else. I thought
> at minimum I would have seen an improvement moving to ORC from TEXTFILE.
>
>
>

Re: Benchmarking Hive Changes

Posted by Anthony Mattas <an...@mattas.net>.

Hi Yong,

I'm confused - I'm using Hive 0.12.0, shouldn't that be using "stinger" by
default? Or is there configurations that have to be enabled? 

Anthony Mattas
anthony@mattas.net


On Wed, Mar 5, 2014 at 11:06 AM, java8964 <ja...@hotmail.com> wrote:

> Your files are too small for any meaningful test of these 3 file types.
>
> Most of the 23 seconds are spending on preparing/starting your MR job and
> shutdown.
>
> You need at least Gs data to compare the performance of these 3 types, to
> get any meaningful result.
>
> But as long as it is Hive on top of MapReduce, it will be really hard to
> archive an "interactive" result. MapReduce is a batch mode, period.
>
> You do want to consider Impala/spark or Apache stinger, if you really are
> looking for "interactive".
>
> Yong
>
> ------------------------------
> Date: Wed, 5 Mar 2014 09:02:32 -0500
> Subject: Re: Benchmarking Hive Changes
> From: anthony@mattas.net
> To: user@hadoop.apache.org
>
>
> Yes, I'm using the HortonWorks Data Platform 2.0 Sandbox which is a
> standalone box.
>
> But shame on me it looks like the files are both very tiny (46K), I'm
> seeing about 23 seconds per query, which appears mostly to be starting up
> MR.
>
> So I'm going to find a new data set and try again, is there any types of
> optimizations that can be done to reduce the start up time?
>
> Ultimately I'm trying to compare the response time in Hive versus an EDW
> platform - of course I still expect the EDW to perform more performantly,
> but with the advancements in the newer versions of Hive I'm hoping for at
> least a reasonable response for a user wishing to do interactive querying.
> Specifically using Hive, I know you can get really good performance out of
> Impala, but am not yet interested in going that route.
>
> Anthony Mattas
> anthony@mattas.net
>
>
> On Wed, Mar 5, 2014 at 8:47 AM, java8964 <ja...@hotmail.com> wrote:
>
> Are you doing on standalone one box? How large are your test files and how
> long of the jobs of each type took?
>
> Yong
>
> > From: anthony@mattas.net
> > Subject: Benchmarking Hive Changes
> > Date: Tue, 4 Mar 2014 21:31:42 -0500
> > To: user@hadoop.apache.org
>
> >
> > I've been trying to benchmark some of the Hive enhancements in Hadoop
> 2.0 using the HDP Sandbox.
> >
> > I took one of their example queries and executed it with the tables
> stored as TEXTFILE, RCFILE, and ORC. I also tried enabling enabling
> vectorized execution, and predicate pushdown.
> >
> > SELECT s07.description, s07.salary, s08.salary,
> > s08.salary - s07.salary
> > FROM
> > sample_07 s07 JOIN sample_08 s08
> > ON ( s07.code = s08.code)
> > WHERE
> > s07.salary < s08.salary
> > SORT BY s08.salary-s07.salary DESC
> >
> > Ultimately there was not much different performance in any of the
> executions, can someone clarify for me if I need an actual full cluster to
> see performance improvements, or if I'm missing something else. I thought
> at minimum I would have seen an improvement moving to ORC from TEXTFILE.
>
>
>

Re: Benchmarking Hive Changes

Posted by Anthony Mattas <an...@mattas.net>.

Hi Yong,

I'm confused - I'm using Hive 0.12.0, shouldn't that be using "stinger" by
default? Or is there configurations that have to be enabled? 

Anthony Mattas
anthony@mattas.net


On Wed, Mar 5, 2014 at 11:06 AM, java8964 <ja...@hotmail.com> wrote:

> Your files are too small for any meaningful test of these 3 file types.
>
> Most of the 23 seconds are spending on preparing/starting your MR job and
> shutdown.
>
> You need at least Gs data to compare the performance of these 3 types, to
> get any meaningful result.
>
> But as long as it is Hive on top of MapReduce, it will be really hard to
> archive an "interactive" result. MapReduce is a batch mode, period.
>
> You do want to consider Impala/spark or Apache stinger, if you really are
> looking for "interactive".
>
> Yong
>
> ------------------------------
> Date: Wed, 5 Mar 2014 09:02:32 -0500
> Subject: Re: Benchmarking Hive Changes
> From: anthony@mattas.net
> To: user@hadoop.apache.org
>
>
> Yes, I'm using the HortonWorks Data Platform 2.0 Sandbox which is a
> standalone box.
>
> But shame on me it looks like the files are both very tiny (46K), I'm
> seeing about 23 seconds per query, which appears mostly to be starting up
> MR.
>
> So I'm going to find a new data set and try again, is there any types of
> optimizations that can be done to reduce the start up time?
>
> Ultimately I'm trying to compare the response time in Hive versus an EDW
> platform - of course I still expect the EDW to perform more performantly,
> but with the advancements in the newer versions of Hive I'm hoping for at
> least a reasonable response for a user wishing to do interactive querying.
> Specifically using Hive, I know you can get really good performance out of
> Impala, but am not yet interested in going that route.
>
> Anthony Mattas
> anthony@mattas.net
>
>
> On Wed, Mar 5, 2014 at 8:47 AM, java8964 <ja...@hotmail.com> wrote:
>
> Are you doing on standalone one box? How large are your test files and how
> long of the jobs of each type took?
>
> Yong
>
> > From: anthony@mattas.net
> > Subject: Benchmarking Hive Changes
> > Date: Tue, 4 Mar 2014 21:31:42 -0500
> > To: user@hadoop.apache.org
>
> >
> > I've been trying to benchmark some of the Hive enhancements in Hadoop
> 2.0 using the HDP Sandbox.
> >
> > I took one of their example queries and executed it with the tables
> stored as TEXTFILE, RCFILE, and ORC. I also tried enabling enabling
> vectorized execution, and predicate pushdown.
> >
> > SELECT s07.description, s07.salary, s08.salary,
> > s08.salary - s07.salary
> > FROM
> > sample_07 s07 JOIN sample_08 s08
> > ON ( s07.code = s08.code)
> > WHERE
> > s07.salary < s08.salary
> > SORT BY s08.salary-s07.salary DESC
> >
> > Ultimately there was not much different performance in any of the
> executions, can someone clarify for me if I need an actual full cluster to
> see performance improvements, or if I'm missing something else. I thought
> at minimum I would have seen an improvement moving to ORC from TEXTFILE.
>
>
>

RE: Benchmarking Hive Changes

Posted by java8964 <ja...@hotmail.com>.

Your files are too small for any meaningful test of these 3 file types.
Most of the 23 seconds are spending on preparing/starting your MR job and shutdown.
You need at least Gs data to compare the performance of these 3 types, to get any meaningful result.
But as long as it is Hive on top of MapReduce, it will be really hard to archive an "interactive" result. MapReduce is a batch mode, period.
You do want to consider Impala/spark or Apache stinger, if you really are looking for "interactive".
Yong

Date: Wed, 5 Mar 2014 09:02:32 -0500
Subject: Re: Benchmarking Hive Changes
From: anthony@mattas.net
To: user@hadoop.apache.org

Yes, I'm using the HortonWorks Data Platform 2.0 Sandbox which is a standalone box.

But shame on me it looks like the files are both very tiny (46K), I'm seeing about 23 seconds per query, which appears mostly to be starting up MR. 

So I'm going to find a new data set and try again, is there any types of optimizations that can be done to reduce the start up time?

Ultimately I'm trying to compare the response time in Hive versus an EDW platform - of course I still expect the EDW to perform more performantly, but with the advancements in the newer versions of Hive I'm hoping for at least a reasonable response for a user wishing to do interactive querying. Specifically using Hive, I know you can get really good performance out of Impala, but am not yet interested in going that route.
Anthony Mattas
anthony@mattas.net

On Wed, Mar 5, 2014 at 8:47 AM, java8964 <ja...@hotmail.com> wrote:

Are you doing on standalone one box? How large are your test files and how long of the jobs of each type took?
Yong

> From: anthony@mattas.net

> Subject: Benchmarking Hive Changes
> Date: Tue, 4 Mar 2014 21:31:42 -0500
> To: user@hadoop.apache.org
> 
> I’ve been trying to benchmark some of the Hive enhancements in Hadoop 2.0 using the HDP Sandbox. 

> 
> I took one of their example queries and executed it with the tables stored as TEXTFILE, RCFILE, and ORC. I also tried enabling enabling vectorized execution, and predicate pushdown.
> 
> SELECT s07.description, s07.salary, s08.salary,

>   s08.salary - s07.salary
> FROM
>   sample_07 s07 JOIN sample_08 s08
> ON ( s07.code = s08.code)
> WHERE
>  s07.salary < s08.salary
> SORT BY s08.salary-s07.salary DESC
> 

> Ultimately there was not much different performance in any of the executions, can someone clarify for me if I need an actual full cluster to see performance improvements, or if I’m missing something else. I thought at minimum I would have seen an improvement moving to ORC from TEXTFILE.

RE: Benchmarking Hive Changes

Posted by java8964 <ja...@hotmail.com>.

Your files are too small for any meaningful test of these 3 file types.
Most of the 23 seconds are spending on preparing/starting your MR job and shutdown.
You need at least Gs data to compare the performance of these 3 types, to get any meaningful result.
But as long as it is Hive on top of MapReduce, it will be really hard to archive an "interactive" result. MapReduce is a batch mode, period.
You do want to consider Impala/spark or Apache stinger, if you really are looking for "interactive".
Yong

Date: Wed, 5 Mar 2014 09:02:32 -0500
Subject: Re: Benchmarking Hive Changes
From: anthony@mattas.net
To: user@hadoop.apache.org

Yes, I'm using the HortonWorks Data Platform 2.0 Sandbox which is a standalone box.

But shame on me it looks like the files are both very tiny (46K), I'm seeing about 23 seconds per query, which appears mostly to be starting up MR. 

So I'm going to find a new data set and try again, is there any types of optimizations that can be done to reduce the start up time?

Ultimately I'm trying to compare the response time in Hive versus an EDW platform - of course I still expect the EDW to perform more performantly, but with the advancements in the newer versions of Hive I'm hoping for at least a reasonable response for a user wishing to do interactive querying. Specifically using Hive, I know you can get really good performance out of Impala, but am not yet interested in going that route.
Anthony Mattas
anthony@mattas.net

On Wed, Mar 5, 2014 at 8:47 AM, java8964 <ja...@hotmail.com> wrote:

Are you doing on standalone one box? How large are your test files and how long of the jobs of each type took?
Yong

> From: anthony@mattas.net

> Subject: Benchmarking Hive Changes
> Date: Tue, 4 Mar 2014 21:31:42 -0500
> To: user@hadoop.apache.org
> 
> I’ve been trying to benchmark some of the Hive enhancements in Hadoop 2.0 using the HDP Sandbox. 

> 
> I took one of their example queries and executed it with the tables stored as TEXTFILE, RCFILE, and ORC. I also tried enabling enabling vectorized execution, and predicate pushdown.
> 
> SELECT s07.description, s07.salary, s08.salary,

>   s08.salary - s07.salary
> FROM
>   sample_07 s07 JOIN sample_08 s08
> ON ( s07.code = s08.code)
> WHERE
>  s07.salary < s08.salary
> SORT BY s08.salary-s07.salary DESC
> 

> Ultimately there was not much different performance in any of the executions, can someone clarify for me if I need an actual full cluster to see performance improvements, or if I’m missing something else. I thought at minimum I would have seen an improvement moving to ORC from TEXTFILE.

RE: Benchmarking Hive Changes

Posted by java8964 <ja...@hotmail.com>.

Your files are too small for any meaningful test of these 3 file types.
Most of the 23 seconds are spending on preparing/starting your MR job and shutdown.
You need at least Gs data to compare the performance of these 3 types, to get any meaningful result.
But as long as it is Hive on top of MapReduce, it will be really hard to archive an "interactive" result. MapReduce is a batch mode, period.
You do want to consider Impala/spark or Apache stinger, if you really are looking for "interactive".
Yong

Date: Wed, 5 Mar 2014 09:02:32 -0500
Subject: Re: Benchmarking Hive Changes
From: anthony@mattas.net
To: user@hadoop.apache.org

Yes, I'm using the HortonWorks Data Platform 2.0 Sandbox which is a standalone box.

But shame on me it looks like the files are both very tiny (46K), I'm seeing about 23 seconds per query, which appears mostly to be starting up MR. 

So I'm going to find a new data set and try again, is there any types of optimizations that can be done to reduce the start up time?

Ultimately I'm trying to compare the response time in Hive versus an EDW platform - of course I still expect the EDW to perform more performantly, but with the advancements in the newer versions of Hive I'm hoping for at least a reasonable response for a user wishing to do interactive querying. Specifically using Hive, I know you can get really good performance out of Impala, but am not yet interested in going that route.
Anthony Mattas
anthony@mattas.net

On Wed, Mar 5, 2014 at 8:47 AM, java8964 <ja...@hotmail.com> wrote:

Are you doing on standalone one box? How large are your test files and how long of the jobs of each type took?
Yong

> From: anthony@mattas.net

> Subject: Benchmarking Hive Changes
> Date: Tue, 4 Mar 2014 21:31:42 -0500
> To: user@hadoop.apache.org
> 
> I’ve been trying to benchmark some of the Hive enhancements in Hadoop 2.0 using the HDP Sandbox. 

> 
> I took one of their example queries and executed it with the tables stored as TEXTFILE, RCFILE, and ORC. I also tried enabling enabling vectorized execution, and predicate pushdown.
> 
> SELECT s07.description, s07.salary, s08.salary,

>   s08.salary - s07.salary
> FROM
>   sample_07 s07 JOIN sample_08 s08
> ON ( s07.code = s08.code)
> WHERE
>  s07.salary < s08.salary
> SORT BY s08.salary-s07.salary DESC
> 

> Ultimately there was not much different performance in any of the executions, can someone clarify for me if I need an actual full cluster to see performance improvements, or if I’m missing something else. I thought at minimum I would have seen an improvement moving to ORC from TEXTFILE.

RE: Benchmarking Hive Changes

Posted by java8964 <ja...@hotmail.com>.

Your files are too small for any meaningful test of these 3 file types.
Most of the 23 seconds are spending on preparing/starting your MR job and shutdown.
You need at least Gs data to compare the performance of these 3 types, to get any meaningful result.
But as long as it is Hive on top of MapReduce, it will be really hard to archive an "interactive" result. MapReduce is a batch mode, period.
You do want to consider Impala/spark or Apache stinger, if you really are looking for "interactive".
Yong

Date: Wed, 5 Mar 2014 09:02:32 -0500
Subject: Re: Benchmarking Hive Changes
From: anthony@mattas.net
To: user@hadoop.apache.org

Yes, I'm using the HortonWorks Data Platform 2.0 Sandbox which is a standalone box.

But shame on me it looks like the files are both very tiny (46K), I'm seeing about 23 seconds per query, which appears mostly to be starting up MR. 

So I'm going to find a new data set and try again, is there any types of optimizations that can be done to reduce the start up time?

Ultimately I'm trying to compare the response time in Hive versus an EDW platform - of course I still expect the EDW to perform more performantly, but with the advancements in the newer versions of Hive I'm hoping for at least a reasonable response for a user wishing to do interactive querying. Specifically using Hive, I know you can get really good performance out of Impala, but am not yet interested in going that route.
Anthony Mattas
anthony@mattas.net

On Wed, Mar 5, 2014 at 8:47 AM, java8964 <ja...@hotmail.com> wrote:

Are you doing on standalone one box? How large are your test files and how long of the jobs of each type took?
Yong

> From: anthony@mattas.net

> Subject: Benchmarking Hive Changes
> Date: Tue, 4 Mar 2014 21:31:42 -0500
> To: user@hadoop.apache.org
> 
> I’ve been trying to benchmark some of the Hive enhancements in Hadoop 2.0 using the HDP Sandbox. 

> 
> I took one of their example queries and executed it with the tables stored as TEXTFILE, RCFILE, and ORC. I also tried enabling enabling vectorized execution, and predicate pushdown.
> 
> SELECT s07.description, s07.salary, s08.salary,

>   s08.salary - s07.salary
> FROM
>   sample_07 s07 JOIN sample_08 s08
> ON ( s07.code = s08.code)
> WHERE
>  s07.salary < s08.salary
> SORT BY s08.salary-s07.salary DESC
> 

> Ultimately there was not much different performance in any of the executions, can someone clarify for me if I need an actual full cluster to see performance improvements, or if I’m missing something else. I thought at minimum I would have seen an improvement moving to ORC from TEXTFILE.

Re: Benchmarking Hive Changes

Posted by Anthony Mattas <an...@mattas.net>.

Yes, I'm using the HortonWorks Data Platform 2.0 Sandbox which is a
standalone box.

But shame on me it looks like the files are both very tiny (46K), I'm
seeing about 23 seconds per query, which appears mostly to be starting up
MR.

So I'm going to find a new data set and try again, is there any types of
optimizations that can be done to reduce the start up time?

Ultimately I'm trying to compare the response time in Hive versus an EDW
platform - of course I still expect the EDW to perform more performantly,
but with the advancements in the newer versions of Hive I'm hoping for at
least a reasonable response for a user wishing to do interactive querying.
Specifically using Hive, I know you can get really good performance out of
Impala, but am not yet interested in going that route.

Anthony Mattas
anthony@mattas.net

On Wed, Mar 5, 2014 at 8:47 AM, java8964 <ja...@hotmail.com> wrote:

> Are you doing on standalone one box? How large are your test files and how
> long of the jobs of each type took?
>
> Yong
>
> > From: anthony@mattas.net
> > Subject: Benchmarking Hive Changes
> > Date: Tue, 4 Mar 2014 21:31:42 -0500
> > To: user@hadoop.apache.org
>
> >
> > I've been trying to benchmark some of the Hive enhancements in Hadoop
> 2.0 using the HDP Sandbox.
> >
> > I took one of their example queries and executed it with the tables
> stored as TEXTFILE, RCFILE, and ORC. I also tried enabling enabling
> vectorized execution, and predicate pushdown.
> >
> > SELECT s07.description, s07.salary, s08.salary,
> > s08.salary - s07.salary
> > FROM
> > sample_07 s07 JOIN sample_08 s08
> > ON ( s07.code = s08.code)
> > WHERE
> > s07.salary < s08.salary
> > SORT BY s08.salary-s07.salary DESC
> >
> > Ultimately there was not much different performance in any of the
> executions, can someone clarify for me if I need an actual full cluster to
> see performance improvements, or if I'm missing something else. I thought
> at minimum I would have seen an improvement moving to ORC from TEXTFILE.
>

Re: Benchmarking Hive Changes

Posted by Anthony Mattas <an...@mattas.net>.

Yes, I'm using the HortonWorks Data Platform 2.0 Sandbox which is a
standalone box.

But shame on me it looks like the files are both very tiny (46K), I'm
seeing about 23 seconds per query, which appears mostly to be starting up
MR.

So I'm going to find a new data set and try again, is there any types of
optimizations that can be done to reduce the start up time?

Ultimately I'm trying to compare the response time in Hive versus an EDW
platform - of course I still expect the EDW to perform more performantly,
but with the advancements in the newer versions of Hive I'm hoping for at
least a reasonable response for a user wishing to do interactive querying.
Specifically using Hive, I know you can get really good performance out of
Impala, but am not yet interested in going that route.

Anthony Mattas
anthony@mattas.net

On Wed, Mar 5, 2014 at 8:47 AM, java8964 <ja...@hotmail.com> wrote:

> Are you doing on standalone one box? How large are your test files and how
> long of the jobs of each type took?
>
> Yong
>
> > From: anthony@mattas.net
> > Subject: Benchmarking Hive Changes
> > Date: Tue, 4 Mar 2014 21:31:42 -0500
> > To: user@hadoop.apache.org
>
> >
> > I've been trying to benchmark some of the Hive enhancements in Hadoop
> 2.0 using the HDP Sandbox.
> >
> > I took one of their example queries and executed it with the tables
> stored as TEXTFILE, RCFILE, and ORC. I also tried enabling enabling
> vectorized execution, and predicate pushdown.
> >
> > SELECT s07.description, s07.salary, s08.salary,
> > s08.salary - s07.salary
> > FROM
> > sample_07 s07 JOIN sample_08 s08
> > ON ( s07.code = s08.code)
> > WHERE
> > s07.salary < s08.salary
> > SORT BY s08.salary-s07.salary DESC
> >
> > Ultimately there was not much different performance in any of the
> executions, can someone clarify for me if I need an actual full cluster to
> see performance improvements, or if I'm missing something else. I thought
> at minimum I would have seen an improvement moving to ORC from TEXTFILE.
>

Re: Benchmarking Hive Changes

Posted by Anthony Mattas <an...@mattas.net>.

Yes, I'm using the HortonWorks Data Platform 2.0 Sandbox which is a
standalone box.

But shame on me it looks like the files are both very tiny (46K), I'm
seeing about 23 seconds per query, which appears mostly to be starting up
MR.

So I'm going to find a new data set and try again, is there any types of
optimizations that can be done to reduce the start up time?

Ultimately I'm trying to compare the response time in Hive versus an EDW
platform - of course I still expect the EDW to perform more performantly,
but with the advancements in the newer versions of Hive I'm hoping for at
least a reasonable response for a user wishing to do interactive querying.
Specifically using Hive, I know you can get really good performance out of
Impala, but am not yet interested in going that route.

Anthony Mattas
anthony@mattas.net

On Wed, Mar 5, 2014 at 8:47 AM, java8964 <ja...@hotmail.com> wrote:

> Are you doing on standalone one box? How large are your test files and how
> long of the jobs of each type took?
>
> Yong
>
> > From: anthony@mattas.net
> > Subject: Benchmarking Hive Changes
> > Date: Tue, 4 Mar 2014 21:31:42 -0500
> > To: user@hadoop.apache.org
>
> >
> > I've been trying to benchmark some of the Hive enhancements in Hadoop
> 2.0 using the HDP Sandbox.
> >
> > I took one of their example queries and executed it with the tables
> stored as TEXTFILE, RCFILE, and ORC. I also tried enabling enabling
> vectorized execution, and predicate pushdown.
> >
> > SELECT s07.description, s07.salary, s08.salary,
> > s08.salary - s07.salary
> > FROM
> > sample_07 s07 JOIN sample_08 s08
> > ON ( s07.code = s08.code)
> > WHERE
> > s07.salary < s08.salary
> > SORT BY s08.salary-s07.salary DESC
> >
> > Ultimately there was not much different performance in any of the
> executions, can someone clarify for me if I need an actual full cluster to
> see performance improvements, or if I'm missing something else. I thought
> at minimum I would have seen an improvement moving to ORC from TEXTFILE.
>

Re: Benchmarking Hive Changes

Posted by Anthony Mattas <an...@mattas.net>.

Yes, I'm using the HortonWorks Data Platform 2.0 Sandbox which is a
standalone box.

But shame on me it looks like the files are both very tiny (46K), I'm
seeing about 23 seconds per query, which appears mostly to be starting up
MR.

So I'm going to find a new data set and try again, is there any types of
optimizations that can be done to reduce the start up time?

Ultimately I'm trying to compare the response time in Hive versus an EDW
platform - of course I still expect the EDW to perform more performantly,
but with the advancements in the newer versions of Hive I'm hoping for at
least a reasonable response for a user wishing to do interactive querying.
Specifically using Hive, I know you can get really good performance out of
Impala, but am not yet interested in going that route.

Anthony Mattas
anthony@mattas.net

On Wed, Mar 5, 2014 at 8:47 AM, java8964 <ja...@hotmail.com> wrote:

> Are you doing on standalone one box? How large are your test files and how
> long of the jobs of each type took?
>
> Yong
>
> > From: anthony@mattas.net
> > Subject: Benchmarking Hive Changes
> > Date: Tue, 4 Mar 2014 21:31:42 -0500
> > To: user@hadoop.apache.org
>
> >
> > I've been trying to benchmark some of the Hive enhancements in Hadoop
> 2.0 using the HDP Sandbox.
> >
> > I took one of their example queries and executed it with the tables
> stored as TEXTFILE, RCFILE, and ORC. I also tried enabling enabling
> vectorized execution, and predicate pushdown.
> >
> > SELECT s07.description, s07.salary, s08.salary,
> > s08.salary - s07.salary
> > FROM
> > sample_07 s07 JOIN sample_08 s08
> > ON ( s07.code = s08.code)
> > WHERE
> > s07.salary < s08.salary
> > SORT BY s08.salary-s07.salary DESC
> >
> > Ultimately there was not much different performance in any of the
> executions, can someone clarify for me if I need an actual full cluster to
> see performance improvements, or if I'm missing something else. I thought
> at minimum I would have seen an improvement moving to ORC from TEXTFILE.
>

RE: Benchmarking Hive Changes

Posted by java8964 <ja...@hotmail.com>.

Are you doing on standalone one box? How large are your test files and how long of the jobs of each type took?
Yong

> From: anthony@mattas.net
> Subject: Benchmarking Hive Changes
> Date: Tue, 4 Mar 2014 21:31:42 -0500
> To: user@hadoop.apache.org
> 
> I’ve been trying to benchmark some of the Hive enhancements in Hadoop 2.0 using the HDP Sandbox. 
> 
> I took one of their example queries and executed it with the tables stored as TEXTFILE, RCFILE, and ORC. I also tried enabling enabling vectorized execution, and predicate pushdown.
> 
> SELECT s07.description, s07.salary, s08.salary,
>   s08.salary - s07.salary
> FROM
>   sample_07 s07 JOIN sample_08 s08
> ON ( s07.code = s08.code)
> WHERE
>  s07.salary < s08.salary
> SORT BY s08.salary-s07.salary DESC
> 
> Ultimately there was not much different performance in any of the executions, can someone clarify for me if I need an actual full cluster to see performance improvements, or if I’m missing something else. I thought at minimum I would have seen an improvement moving to ORC from TEXTFILE.

RE: Benchmarking Hive Changes

Posted by java8964 <ja...@hotmail.com>.

Are you doing on standalone one box? How large are your test files and how long of the jobs of each type took?
Yong

> From: anthony@mattas.net
> Subject: Benchmarking Hive Changes
> Date: Tue, 4 Mar 2014 21:31:42 -0500
> To: user@hadoop.apache.org
> 
> I’ve been trying to benchmark some of the Hive enhancements in Hadoop 2.0 using the HDP Sandbox. 
> 
> I took one of their example queries and executed it with the tables stored as TEXTFILE, RCFILE, and ORC. I also tried enabling enabling vectorized execution, and predicate pushdown.
> 
> SELECT s07.description, s07.salary, s08.salary,
>   s08.salary - s07.salary
> FROM
>   sample_07 s07 JOIN sample_08 s08
> ON ( s07.code = s08.code)
> WHERE
>  s07.salary < s08.salary
> SORT BY s08.salary-s07.salary DESC
> 
> Ultimately there was not much different performance in any of the executions, can someone clarify for me if I need an actual full cluster to see performance improvements, or if I’m missing something else. I thought at minimum I would have seen an improvement moving to ORC from TEXTFILE.

RE: Benchmarking Hive Changes

Posted by java8964 <ja...@hotmail.com>.

Are you doing on standalone one box? How large are your test files and how long of the jobs of each type took?
Yong

> From: anthony@mattas.net
> Subject: Benchmarking Hive Changes
> Date: Tue, 4 Mar 2014 21:31:42 -0500
> To: user@hadoop.apache.org
> 
> I’ve been trying to benchmark some of the Hive enhancements in Hadoop 2.0 using the HDP Sandbox. 
> 
> I took one of their example queries and executed it with the tables stored as TEXTFILE, RCFILE, and ORC. I also tried enabling enabling vectorized execution, and predicate pushdown.
> 
> SELECT s07.description, s07.salary, s08.salary,
>   s08.salary - s07.salary
> FROM
>   sample_07 s07 JOIN sample_08 s08
> ON ( s07.code = s08.code)
> WHERE
>  s07.salary < s08.salary
> SORT BY s08.salary-s07.salary DESC
> 
> Ultimately there was not much different performance in any of the executions, can someone clarify for me if I need an actual full cluster to see performance improvements, or if I’m missing something else. I thought at minimum I would have seen an improvement moving to ORC from TEXTFILE.

RE: Benchmarking Hive Changes

Posted by java8964 <ja...@hotmail.com>.

Are you doing on standalone one box? How large are your test files and how long of the jobs of each type took?
Yong

> From: anthony@mattas.net
> Subject: Benchmarking Hive Changes
> Date: Tue, 4 Mar 2014 21:31:42 -0500
> To: user@hadoop.apache.org
> 
> I’ve been trying to benchmark some of the Hive enhancements in Hadoop 2.0 using the HDP Sandbox. 
> 
> I took one of their example queries and executed it with the tables stored as TEXTFILE, RCFILE, and ORC. I also tried enabling enabling vectorized execution, and predicate pushdown.
> 
> SELECT s07.description, s07.salary, s08.salary,
>   s08.salary - s07.salary
> FROM
>   sample_07 s07 JOIN sample_08 s08
> ON ( s07.code = s08.code)
> WHERE
>  s07.salary < s08.salary
> SORT BY s08.salary-s07.salary DESC
> 
> Ultimately there was not much different performance in any of the executions, can someone clarify for me if I need an actual full cluster to see performance improvements, or if I’m missing something else. I thought at minimum I would have seen an improvement moving to ORC from TEXTFILE.