You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Leo Alekseyev <dn...@gmail.com> on 2010/08/14 05:16:47 UTC

Why two map stages for a simple select query?

Hi all,
I'm mystified by Hive's behavior for two types of queries.

1: consider the following simple select query:
insert overwrite table alogs_test_extracted1
select raw.client_ip, raw.cookie, raw.referrer_flag
from alogs_test_rc6 raw;
Both tables are stored as rcfiles, and LZO compression is turned on.

Hive runs this in two jobs: a map-only, and a map-reduce.  Question:
can someone explain to me _what_ hive is doing in the two map jobs?..
I stared at the output of EXPLAIN, but can't figure out what is going
on.  When I do similar extractions by hand, I have a mapper that pulls
out fields from records, and (optionally) a reducer that combines the
results -- that is, one map stage.  Why are there two here?..  (about
30% of the time is spent on the first map stage, 45% on the second map
stage, and 25% on the reduce step).

2: consider the "transform..using" query below:
insert overwrite table alogs_test_rc6
select
  transform (d.ll)
    using 'java myProcessingClass'
    as (field1, field2, field3)
from (select logline as ll from raw_log_test1day) d;

Here, Hive plan (as shown via EXPLAIN) also suggests two MR stages: a
map, and a map-reduce.  However, when the job actually runs, Hive says
"Launching job 1 out of 2", runs the transform script in mappers,
writes the table, and never launches job 2 (the map-reduce stage in
the plan)!  Why is this happening, and can I control this behavior?..
Sometimes it would be preferable for me to run a map-only job (perhaps
combining input data for mappers with CombineFileInputFormat to avoid
generating thousands of 20MB files).

Thanks in advance to anyone who can clarify Hive's behavior here...
--Leo

Re: Why two map stages for a simple select query?

Posted by Ning Zhang <nz...@facebook.com>.

We have a plan to migrate to the new mapreduce API, but probably not very soon. 

On Aug 18, 2010, at 1:13 AM, Leo Alekseyev wrote:

>>> 
>> Using CombineHiveInputFormat in a map-only job to merge small files is a good idea. Actually this is what HIVE-1307 will do. I'm not aware of the signature difference in Cloudera's Hadoop distribution. The Hive's createPool() signature is compatible with Hadoop 0.20.2 API and the future HIVE-1307 patch should also stay with the same API. So you may want to ask on the Cloudera forum to see if it can be supported.
> 
> Cloudera deprecated
> org.apache.hadoop.mapred.lib.CombineFileInputFormat (see
> http://archive.cloudera.com/cdh/3/hadoop-0.20.2+320/api/org/apache/hadoop/mapred/lib/CombineFileInputFormat.html)
> and made it inherit from the (non-deprecated)
> org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat, which
> has different method signatures because the new API doesn't use
> JobConf.  Apache Hadoop 0.20.2 does _not_ implement
> CombineFileInputFormat for the new API, but 0.21 does.
> 
> Note that in addition, Hadoop 0.21 explicitly deprecates the old
> org.apache.hadoop.mapred.lib.CombineFileInputFormat, and makes it
> inherit from org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.
> However, unlike Cloudera, Apache provides wrappers to preserve method
> signatures.
> 
> So is there any plan to migrate to the new API, or are you happy with
> using the deprecated (as of 0.21) API, provided that it's backwards
> compatible?..
> 
> 
> 
> 
>> 
>>> --Leo
>>> 
>>> On Fri, Aug 13, 2010 at 8:45 PM, Ning Zhang <nz...@facebook.com> wrote:
>>>> The second map-reduce job is probably the merge job which takes the output of the first map-only job (the real query) and merge the resulting files. The merge job is not always triggered. If you look at the plan you may find it is a child of a conditional task, which means it is conditionally triggered based on the results of the first map-only job.
>>>> 
>>>> You can control to not run the merge task by setting hive.merge.mapfiles=false. Likewise hive.merge.mapredfiles is used to control whether to merge the result of a map-reduce job.
>>>> 
>>>> On Aug 13, 2010, at 8:16 PM, Leo Alekseyev wrote:
>>>> 
>>>>> Hi all,
>>>>> I'm mystified by Hive's behavior for two types of queries.
>>>>> 
>>>>> 1: consider the following simple select query:
>>>>> insert overwrite table alogs_test_extracted1
>>>>> select raw.client_ip, raw.cookie, raw.referrer_flag
>>>>> from alogs_test_rc6 raw;
>>>>> Both tables are stored as rcfiles, and LZO compression is turned on.
>>>>> 
>>>>> Hive runs this in two jobs: a map-only, and a map-reduce.  Question:
>>>>> can someone explain to me _what_ hive is doing in the two map jobs?..
>>>>> I stared at the output of EXPLAIN, but can't figure out what is going
>>>>> on.  When I do similar extractions by hand, I have a mapper that pulls
>>>>> out fields from records, and (optionally) a reducer that combines the
>>>>> results -- that is, one map stage.  Why are there two here?..  (about
>>>>> 30% of the time is spent on the first map stage, 45% on the second map
>>>>> stage, and 25% on the reduce step).
>>>>> 
>>>>> 2: consider the "transform..using" query below:
>>>>> insert overwrite table alogs_test_rc6
>>>>> select
>>>>>  transform (d.ll)
>>>>>    using 'java myProcessingClass'
>>>>>    as (field1, field2, field3)
>>>>> from (select logline as ll from raw_log_test1day) d;
>>>>> 
>>>>> Here, Hive plan (as shown via EXPLAIN) also suggests two MR stages: a
>>>>> map, and a map-reduce.  However, when the job actually runs, Hive says
>>>>> "Launching job 1 out of 2", runs the transform script in mappers,
>>>>> writes the table, and never launches job 2 (the map-reduce stage in
>>>>> the plan)!  Why is this happening, and can I control this behavior?..
>>>>> Sometimes it would be preferable for me to run a map-only job (perhaps
>>>>> combining input data for mappers with CombineFileInputFormat to avoid
>>>>> generating thousands of 20MB files).
>>>>> 
>>>>> Thanks in advance to anyone who can clarify Hive's behavior here...
>>>>> --Leo
>>>> 
>>>> 
>> 
>>

Re: Why two map stages for a simple select query?

Posted by Leo Alekseyev <dn...@gmail.com>.

>>
> Using CombineHiveInputFormat in a map-only job to merge small files is a good idea. Actually this is what HIVE-1307 will do. I'm not aware of the signature difference in Cloudera's Hadoop distribution. The Hive's createPool() signature is compatible with Hadoop 0.20.2 API and the future HIVE-1307 patch should also stay with the same API. So you may want to ask on the Cloudera forum to see if it can be supported.

Cloudera deprecated
org.apache.hadoop.mapred.lib.CombineFileInputFormat (see
http://archive.cloudera.com/cdh/3/hadoop-0.20.2+320/api/org/apache/hadoop/mapred/lib/CombineFileInputFormat.html)
 and made it inherit from the (non-deprecated)
org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat, which
has different method signatures because the new API doesn't use
JobConf.  Apache Hadoop 0.20.2 does _not_ implement
CombineFileInputFormat for the new API, but 0.21 does.

Note that in addition, Hadoop 0.21 explicitly deprecates the old
org.apache.hadoop.mapred.lib.CombineFileInputFormat, and makes it
inherit from org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.
 However, unlike Cloudera, Apache provides wrappers to preserve method
signatures.

So is there any plan to migrate to the new API, or are you happy with
using the deprecated (as of 0.21) API, provided that it's backwards
compatible?..




>
>> --Leo
>>
>> On Fri, Aug 13, 2010 at 8:45 PM, Ning Zhang <nz...@facebook.com> wrote:
>>> The second map-reduce job is probably the merge job which takes the output of the first map-only job (the real query) and merge the resulting files. The merge job is not always triggered. If you look at the plan you may find it is a child of a conditional task, which means it is conditionally triggered based on the results of the first map-only job.
>>>
>>> You can control to not run the merge task by setting hive.merge.mapfiles=false. Likewise hive.merge.mapredfiles is used to control whether to merge the result of a map-reduce job.
>>>
>>> On Aug 13, 2010, at 8:16 PM, Leo Alekseyev wrote:
>>>
>>>> Hi all,
>>>> I'm mystified by Hive's behavior for two types of queries.
>>>>
>>>> 1: consider the following simple select query:
>>>> insert overwrite table alogs_test_extracted1
>>>> select raw.client_ip, raw.cookie, raw.referrer_flag
>>>> from alogs_test_rc6 raw;
>>>> Both tables are stored as rcfiles, and LZO compression is turned on.
>>>>
>>>> Hive runs this in two jobs: a map-only, and a map-reduce.  Question:
>>>> can someone explain to me _what_ hive is doing in the two map jobs?..
>>>> I stared at the output of EXPLAIN, but can't figure out what is going
>>>> on.  When I do similar extractions by hand, I have a mapper that pulls
>>>> out fields from records, and (optionally) a reducer that combines the
>>>> results -- that is, one map stage.  Why are there two here?..  (about
>>>> 30% of the time is spent on the first map stage, 45% on the second map
>>>> stage, and 25% on the reduce step).
>>>>
>>>> 2: consider the "transform..using" query below:
>>>> insert overwrite table alogs_test_rc6
>>>> select
>>>>  transform (d.ll)
>>>>    using 'java myProcessingClass'
>>>>    as (field1, field2, field3)
>>>> from (select logline as ll from raw_log_test1day) d;
>>>>
>>>> Here, Hive plan (as shown via EXPLAIN) also suggests two MR stages: a
>>>> map, and a map-reduce.  However, when the job actually runs, Hive says
>>>> "Launching job 1 out of 2", runs the transform script in mappers,
>>>> writes the table, and never launches job 2 (the map-reduce stage in
>>>> the plan)!  Why is this happening, and can I control this behavior?..
>>>> Sometimes it would be preferable for me to run a map-only job (perhaps
>>>> combining input data for mappers with CombineFileInputFormat to avoid
>>>> generating thousands of 20MB files).
>>>>
>>>> Thanks in advance to anyone who can clarify Hive's behavior here...
>>>> --Leo
>>>
>>>
>
>

Re: Why two map stages for a simple select query?

Posted by Ning Zhang <nz...@facebook.com>.

On Aug 13, 2010, at 9:52 PM, Leo Alekseyev wrote:

> Ning, thanks -- I can indeed force a map-only task with
> hive.merge.mapfiles=false.  However, I'm still curious what triggers
> whether or not the merge MR job is run?..  In my original message I
> gave two sample queries; I believe hive.merge.mapfiles was set to true
> for both of them.  But for the first one, the merge MR job ran, while
> for the second, Hive only ran the first map stage and then printed
> something like  Ended Job = 590224440, job is filtered out (removed at
> runtime).
> 
Whether a merge MR job is triggered is determined by 1) if there are more than 1 files and 2) if the average size of the files are less than hive.merge.smallfiles.avgsize (default = 16MB). These are runtime conditions so the merge job are filtered out if one of the conditions is false. 


> Also, would you recommend CombineFileInputFormat with map-only jobs to
> better control the number of chunks on the output?..  Right now I seem
> to have a choice between having 10,000 20MB files or merge into larger
> files but increasing my compute time by x3 in the merge MR job.  (As a
> side note, CombineFileInputFormat doesn't work with Cloudera's Hadoop
> 0.20.1 due to some different method signatures in createPool(...), so
> I want to make sure it's worth getting it to work before I start
> making major changes to our deployment.)
> 
Using CombineHiveInputFormat in a map-only job to merge small files is a good idea. Actually this is what HIVE-1307 will do. I'm not aware of the signature difference in Cloudera's Hadoop distribution. The Hive's createPool() signature is compatible with Hadoop 0.20.2 API and the future HIVE-1307 patch should also stay with the same API. So you may want to ask on the Cloudera forum to see if it can be supported.  

> --Leo
> 
> On Fri, Aug 13, 2010 at 8:45 PM, Ning Zhang <nz...@facebook.com> wrote:
>> The second map-reduce job is probably the merge job which takes the output of the first map-only job (the real query) and merge the resulting files. The merge job is not always triggered. If you look at the plan you may find it is a child of a conditional task, which means it is conditionally triggered based on the results of the first map-only job.
>> 
>> You can control to not run the merge task by setting hive.merge.mapfiles=false. Likewise hive.merge.mapredfiles is used to control whether to merge the result of a map-reduce job.
>> 
>> On Aug 13, 2010, at 8:16 PM, Leo Alekseyev wrote:
>> 
>>> Hi all,
>>> I'm mystified by Hive's behavior for two types of queries.
>>> 
>>> 1: consider the following simple select query:
>>> insert overwrite table alogs_test_extracted1
>>> select raw.client_ip, raw.cookie, raw.referrer_flag
>>> from alogs_test_rc6 raw;
>>> Both tables are stored as rcfiles, and LZO compression is turned on.
>>> 
>>> Hive runs this in two jobs: a map-only, and a map-reduce.  Question:
>>> can someone explain to me _what_ hive is doing in the two map jobs?..
>>> I stared at the output of EXPLAIN, but can't figure out what is going
>>> on.  When I do similar extractions by hand, I have a mapper that pulls
>>> out fields from records, and (optionally) a reducer that combines the
>>> results -- that is, one map stage.  Why are there two here?..  (about
>>> 30% of the time is spent on the first map stage, 45% on the second map
>>> stage, and 25% on the reduce step).
>>> 
>>> 2: consider the "transform..using" query below:
>>> insert overwrite table alogs_test_rc6
>>> select
>>>  transform (d.ll)
>>>    using 'java myProcessingClass'
>>>    as (field1, field2, field3)
>>> from (select logline as ll from raw_log_test1day) d;
>>> 
>>> Here, Hive plan (as shown via EXPLAIN) also suggests two MR stages: a
>>> map, and a map-reduce.  However, when the job actually runs, Hive says
>>> "Launching job 1 out of 2", runs the transform script in mappers,
>>> writes the table, and never launches job 2 (the map-reduce stage in
>>> the plan)!  Why is this happening, and can I control this behavior?..
>>> Sometimes it would be preferable for me to run a map-only job (perhaps
>>> combining input data for mappers with CombineFileInputFormat to avoid
>>> generating thousands of 20MB files).
>>> 
>>> Thanks in advance to anyone who can clarify Hive's behavior here...
>>> --Leo
>> 
>>

Re: Why two map stages for a simple select query?

Posted by Leo Alekseyev <dn...@gmail.com>.

Ning, thanks -- I can indeed force a map-only task with
hive.merge.mapfiles=false.  However, I'm still curious what triggers
whether or not the merge MR job is run?..  In my original message I
gave two sample queries; I believe hive.merge.mapfiles was set to true
for both of them.  But for the first one, the merge MR job ran, while
for the second, Hive only ran the first map stage and then printed
something like  Ended Job = 590224440, job is filtered out (removed at
runtime).

Also, would you recommend CombineFileInputFormat with map-only jobs to
better control the number of chunks on the output?..  Right now I seem
to have a choice between having 10,000 20MB files or merge into larger
files but increasing my compute time by x3 in the merge MR job.  (As a
side note, CombineFileInputFormat doesn't work with Cloudera's Hadoop
0.20.1 due to some different method signatures in createPool(...), so
I want to make sure it's worth getting it to work before I start
making major changes to our deployment.)

--Leo

On Fri, Aug 13, 2010 at 8:45 PM, Ning Zhang <nz...@facebook.com> wrote:
> The second map-reduce job is probably the merge job which takes the output of the first map-only job (the real query) and merge the resulting files. The merge job is not always triggered. If you look at the plan you may find it is a child of a conditional task, which means it is conditionally triggered based on the results of the first map-only job.
>
> You can control to not run the merge task by setting hive.merge.mapfiles=false. Likewise hive.merge.mapredfiles is used to control whether to merge the result of a map-reduce job.
>
> On Aug 13, 2010, at 8:16 PM, Leo Alekseyev wrote:
>
>> Hi all,
>> I'm mystified by Hive's behavior for two types of queries.
>>
>> 1: consider the following simple select query:
>> insert overwrite table alogs_test_extracted1
>> select raw.client_ip, raw.cookie, raw.referrer_flag
>> from alogs_test_rc6 raw;
>> Both tables are stored as rcfiles, and LZO compression is turned on.
>>
>> Hive runs this in two jobs: a map-only, and a map-reduce.  Question:
>> can someone explain to me _what_ hive is doing in the two map jobs?..
>> I stared at the output of EXPLAIN, but can't figure out what is going
>> on.  When I do similar extractions by hand, I have a mapper that pulls
>> out fields from records, and (optionally) a reducer that combines the
>> results -- that is, one map stage.  Why are there two here?..  (about
>> 30% of the time is spent on the first map stage, 45% on the second map
>> stage, and 25% on the reduce step).
>>
>> 2: consider the "transform..using" query below:
>> insert overwrite table alogs_test_rc6
>> select
>>  transform (d.ll)
>>    using 'java myProcessingClass'
>>    as (field1, field2, field3)
>> from (select logline as ll from raw_log_test1day) d;
>>
>> Here, Hive plan (as shown via EXPLAIN) also suggests two MR stages: a
>> map, and a map-reduce.  However, when the job actually runs, Hive says
>> "Launching job 1 out of 2", runs the transform script in mappers,
>> writes the table, and never launches job 2 (the map-reduce stage in
>> the plan)!  Why is this happening, and can I control this behavior?..
>> Sometimes it would be preferable for me to run a map-only job (perhaps
>> combining input data for mappers with CombineFileInputFormat to avoid
>> generating thousands of 20MB files).
>>
>> Thanks in advance to anyone who can clarify Hive's behavior here...
>> --Leo
>
>

Re: Why two map stages for a simple select query?

Posted by Ning Zhang <nz...@facebook.com>.

The second map-reduce job is probably the merge job which takes the output of the first map-only job (the real query) and merge the resulting files. The merge job is not always triggered. If you look at the plan you may find it is a child of a conditional task, which means it is conditionally triggered based on the results of the first map-only job. 

You can control to not run the merge task by setting hive.merge.mapfiles=false. Likewise hive.merge.mapredfiles is used to control whether to merge the result of a map-reduce job. 

On Aug 13, 2010, at 8:16 PM, Leo Alekseyev wrote:

> Hi all,
> I'm mystified by Hive's behavior for two types of queries.
> 
> 1: consider the following simple select query:
> insert overwrite table alogs_test_extracted1
> select raw.client_ip, raw.cookie, raw.referrer_flag
> from alogs_test_rc6 raw;
> Both tables are stored as rcfiles, and LZO compression is turned on.
> 
> Hive runs this in two jobs: a map-only, and a map-reduce.  Question:
> can someone explain to me _what_ hive is doing in the two map jobs?..
> I stared at the output of EXPLAIN, but can't figure out what is going
> on.  When I do similar extractions by hand, I have a mapper that pulls
> out fields from records, and (optionally) a reducer that combines the
> results -- that is, one map stage.  Why are there two here?..  (about
> 30% of the time is spent on the first map stage, 45% on the second map
> stage, and 25% on the reduce step).
> 
> 2: consider the "transform..using" query below:
> insert overwrite table alogs_test_rc6
> select
>  transform (d.ll)
>    using 'java myProcessingClass'
>    as (field1, field2, field3)
> from (select logline as ll from raw_log_test1day) d;
> 
> Here, Hive plan (as shown via EXPLAIN) also suggests two MR stages: a
> map, and a map-reduce.  However, when the job actually runs, Hive says
> "Launching job 1 out of 2", runs the transform script in mappers,
> writes the table, and never launches job 2 (the map-reduce stage in
> the plan)!  Why is this happening, and can I control this behavior?..
> Sometimes it would be preferable for me to run a map-only job (perhaps
> combining input data for mappers with CombineFileInputFormat to avoid
> generating thousands of 20MB files).
> 
> Thanks in advance to anyone who can clarify Hive's behavior here...
> --Leo