You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Chen Song <ch...@gmail.com> on 2012/12/13 00:32:31 UTC

map side join with group by

I have a silly question on how Hive interpretes a simple query with both
map side join and group by.

Below query will translate into two jobs, with the 1st one as a map only
job doing the join and storing the output in a intermediary location, and
the 2nd one as a map-reduce job taking the output of the 1st job as input
and doing the group by.

SELECT
/*+ MAPJOIN(d) */
table.a, sum(table2.b)
from table
LEFT OUTER JOIN table2
ON table.id = table2.id
where hour = '2012-12-11 11'
group by table.a

Why can't this be done within a single map reduce job? As what I can see
from the query plan is that all 2nd job mapper do is taking the 1st job's
mapper output.

-- 
Chen Song

Re: map side join with group by

Posted by Chen Song <ch...@gmail.com>.

Thanks Nitin. This is all I want to clarify :)

Chen

On Thu, Dec 13, 2012 at 2:30 PM, Nitin Pawar <ni...@gmail.com>wrote:

> to improve the speed of the job they created map only joins so that all
> the records associated with a key fall to a map .. reducers slows it down.
> If the reducer has to do some more job then they launch another job.
>
> bear in mind, when we say map only join we are absolutely sure that speed
> will increase in case data in one of the tables is in the few hundred MB
> ranges. If this has to do with reduce in hand, the processing logic
> completely changes and it also slows down.
>
> Launching a new job for group by is a neat way to measure how much time
> you spent on just join and another on group by so you can easily see two
> different things.
>
> There is no way you can ask a mapjoin to launch a reducer as it is not
> supposed to do.
>
> If you have such case (may be if you think that it will improve
> performance), please feel free to raise a jira and get it reviewed. if its
> valid I think people will provide more ideas
>
>
> On Fri, Dec 14, 2012 at 12:42 AM, Chen Song <ch...@gmail.com>wrote:
>
>> Nitin
>>
>> Yeah. My original question is that is there a way to force Hive (or
>> rather to say, is it possible) to execute map side join at mapper phase and
>> group by in reduce phase. So instead of launching a map only job (join) and
>> map reduce job (group by), doing it altogether in a single MR job. This is
>> obviously not what Hive does but I am wondering if it is a nice feature to
>> have.
>>
>> The point you made (different keys in join and group by) only matters
>> when it is the time in reduce phase, right? As map side join takes care of
>> join at mapper phase, it sounds to me natural that group by can be done in
>> the reduce phase in the same job. The only hassle that I can think of is
>> that map output have to be resorted (based on group by keys).
>>
>> Chen
>>
>> On Thu, Dec 13, 2012 at 1:42 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>
>>> chen in mapside join .. there are no reducers .. its MAP ONLY job
>>>
>>>
>>> On Thu, Dec 13, 2012 at 11:54 PM, Chen Song <ch...@gmail.com>wrote:
>>>
>>>> Understood that fact that it is impossible in the same MR job if both
>>>> join and group by are gonna happen in the reduce phase (because the join
>>>> keys and group by keys are different). But for map side join, the joins
>>>> would be complete by the end of the map phase, and outputs should be ready
>>>> to be distributed to reducers based on group by keys.
>>>>
>>>> Chen
>>>>
>>>>
>>>> On Thu, Dec 13, 2012 at 11:04 AM, Nitin Pawar <ni...@gmail.com>wrote:
>>>>
>>>>> Thats because for the first job the join keys are different and second
>>>>> job group by keys are different, you just cant assume join keys and group
>>>>> keys will be same so they are two different jobs
>>>>>
>>>>>
>>>>> On Thu, Dec 13, 2012 at 8:26 PM, Chen Song <ch...@gmail.com>wrote:
>>>>>
>>>>>> Yeah, my abridged version of query might be a little broken but my
>>>>>> point is that when a query has a map join and group by, even in its
>>>>>> simplified incarnation, it will launch two jobs. I was just wondering why
>>>>>> map join and group by cannot be accomplished in one MR job.
>>>>>>
>>>>>> Best,
>>>>>> Chen
>>>>>>
>>>>>>
>>>>>> On Thu, Dec 13, 2012 at 12:30 AM, Nitin Pawar <
>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>
>>>>>>> I think Chen wanted to know why this is two phased query if I
>>>>>>> understood it correctly
>>>>>>>
>>>>>>> When you run a mapside join .. it just performs the join query ..
>>>>>>> after that to execute the group by part it launches the second job.
>>>>>>> I may be wrong but this is how I saw it whenever I executed group by
>>>>>>> queries
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Dec 13, 2012 at 7:11 AM, Mark Grover <
>>>>>>> grover.markgrover@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi Chen,
>>>>>>>> I think we would need some more information.
>>>>>>>>
>>>>>>>> The query is referring to a table called "d" in the MAPJOIN hint but
>>>>>>>> there is not such table in the query. Moreover, Map joins only make
>>>>>>>> sense when the right table is the one being "mapped" (in other
>>>>>>>> words,
>>>>>>>> being kept in memory) in case of a Left Outer Join, similarly if the
>>>>>>>> left table is the one being "mapped" in case of a Right Outer Join.
>>>>>>>> Let me know if this is not clear, I'd be happy to offer a better
>>>>>>>> explanation.
>>>>>>>>
>>>>>>>> In your query, the where clause on a column called "hour", at this
>>>>>>>> point I am unsure if that's a column of table1 or table2. If it's
>>>>>>>> column on table1, that predicate would get pushed up (if you have
>>>>>>>> hive.optimize.ppd property set to true), so it could possibly be
>>>>>>>> done
>>>>>>>> in 1 MR job (I am not sure if that's presently the case, you will
>>>>>>>> have
>>>>>>>> to check the explain plan). If however, the where clause is on a
>>>>>>>> column in the right table (table2 in your example), it can't be
>>>>>>>> pushed
>>>>>>>> up since a column of the right table can have different values
>>>>>>>> before
>>>>>>>> and after the LEFT OUTER JOIN. Therefore, the where clause would
>>>>>>>> need
>>>>>>>> to be applied in a separate MR job.
>>>>>>>>
>>>>>>>> This is just my understanding, the full proof answer would lie in
>>>>>>>> checking out the explain plans and the Semantic Analyzer code.
>>>>>>>>
>>>>>>>> And for completeness, there is a conditional task (starting Hive
>>>>>>>> 0.7)
>>>>>>>> that will convert your joins automatically to map joins where
>>>>>>>> applicable. This can be enabled by enabling hive.auto.convert.join
>>>>>>>> property.
>>>>>>>>
>>>>>>>> Mark
>>>>>>>>
>>>>>>>> On Wed, Dec 12, 2012 at 3:32 PM, Chen Song <ch...@gmail.com>
>>>>>>>> wrote:
>>>>>>>> > I have a silly question on how Hive interpretes a simple query
>>>>>>>> with both map
>>>>>>>> > side join and group by.
>>>>>>>> >
>>>>>>>> > Below query will translate into two jobs, with the 1st one as a
>>>>>>>> map only job
>>>>>>>> > doing the join and storing the output in a intermediary location,
>>>>>>>> and the
>>>>>>>> > 2nd one as a map-reduce job taking the output of the 1st job as
>>>>>>>> input and
>>>>>>>> > doing the group by.
>>>>>>>> >
>>>>>>>> > SELECT
>>>>>>>> > /*+ MAPJOIN(d) */
>>>>>>>> > table.a, sum(table2.b)
>>>>>>>> > from table
>>>>>>>> > LEFT OUTER JOIN table2
>>>>>>>> > ON table.id = table2.id
>>>>>>>> > where hour = '2012-12-11 11'
>>>>>>>> > group by table.a
>>>>>>>> >
>>>>>>>> > Why can't this be done within a single map reduce job? As what I
>>>>>>>> can see
>>>>>>>> > from the query plan is that all 2nd job mapper do is taking the
>>>>>>>> 1st job's
>>>>>>>> > mapper output.
>>>>>>>> >
>>>>>>>> > --
>>>>>>>> > Chen Song
>>>>>>>> >
>>>>>>>> >
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Nitin Pawar
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Chen Song
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Nitin Pawar
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Chen Song
>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Nitin Pawar
>>>
>>
>>
>>
>> --
>> Chen Song
>>
>>
>>
>
>
> --
> Nitin Pawar
>



-- 
Chen Song

Re: map side join with group by

Posted by Nitin Pawar <ni...@gmail.com>.

to improve the speed of the job they created map only joins so that all the
records associated with a key fall to a map .. reducers slows it down. If
the reducer has to do some more job then they launch another job.

bear in mind, when we say map only join we are absolutely sure that speed
will increase in case data in one of the tables is in the few hundred MB
ranges. If this has to do with reduce in hand, the processing logic
completely changes and it also slows down.

Launching a new job for group by is a neat way to measure how much time you
spent on just join and another on group by so you can easily see two
different things.

There is no way you can ask a mapjoin to launch a reducer as it is not
supposed to do.

If you have such case (may be if you think that it will improve
performance), please feel free to raise a jira and get it reviewed. if its
valid I think people will provide more ideas


On Fri, Dec 14, 2012 at 12:42 AM, Chen Song <ch...@gmail.com> wrote:

> Nitin
>
> Yeah. My original question is that is there a way to force Hive (or rather
> to say, is it possible) to execute map side join at mapper phase and group
> by in reduce phase. So instead of launching a map only job (join) and map
> reduce job (group by), doing it altogether in a single MR job. This is
> obviously not what Hive does but I am wondering if it is a nice feature to
> have.
>
> The point you made (different keys in join and group by) only matters when
> it is the time in reduce phase, right? As map side join takes care of join
> at mapper phase, it sounds to me natural that group by can be done in the
> reduce phase in the same job. The only hassle that I can think of is that
> map output have to be resorted (based on group by keys).
>
> Chen
>
> On Thu, Dec 13, 2012 at 1:42 PM, Nitin Pawar <ni...@gmail.com>wrote:
>
>> chen in mapside join .. there are no reducers .. its MAP ONLY job
>>
>>
>> On Thu, Dec 13, 2012 at 11:54 PM, Chen Song <ch...@gmail.com>wrote:
>>
>>> Understood that fact that it is impossible in the same MR job if both
>>> join and group by are gonna happen in the reduce phase (because the join
>>> keys and group by keys are different). But for map side join, the joins
>>> would be complete by the end of the map phase, and outputs should be ready
>>> to be distributed to reducers based on group by keys.
>>>
>>> Chen
>>>
>>>
>>> On Thu, Dec 13, 2012 at 11:04 AM, Nitin Pawar <ni...@gmail.com>wrote:
>>>
>>>> Thats because for the first job the join keys are different and second
>>>> job group by keys are different, you just cant assume join keys and group
>>>> keys will be same so they are two different jobs
>>>>
>>>>
>>>> On Thu, Dec 13, 2012 at 8:26 PM, Chen Song <ch...@gmail.com>wrote:
>>>>
>>>>> Yeah, my abridged version of query might be a little broken but my
>>>>> point is that when a query has a map join and group by, even in its
>>>>> simplified incarnation, it will launch two jobs. I was just wondering why
>>>>> map join and group by cannot be accomplished in one MR job.
>>>>>
>>>>> Best,
>>>>> Chen
>>>>>
>>>>>
>>>>> On Thu, Dec 13, 2012 at 12:30 AM, Nitin Pawar <nitinpawar432@gmail.com
>>>>> > wrote:
>>>>>
>>>>>> I think Chen wanted to know why this is two phased query if I
>>>>>> understood it correctly
>>>>>>
>>>>>> When you run a mapside join .. it just performs the join query ..
>>>>>> after that to execute the group by part it launches the second job.
>>>>>> I may be wrong but this is how I saw it whenever I executed group by
>>>>>> queries
>>>>>>
>>>>>>
>>>>>> On Thu, Dec 13, 2012 at 7:11 AM, Mark Grover <
>>>>>> grover.markgrover@gmail.com> wrote:
>>>>>>
>>>>>>> Hi Chen,
>>>>>>> I think we would need some more information.
>>>>>>>
>>>>>>> The query is referring to a table called "d" in the MAPJOIN hint but
>>>>>>> there is not such table in the query. Moreover, Map joins only make
>>>>>>> sense when the right table is the one being "mapped" (in other words,
>>>>>>> being kept in memory) in case of a Left Outer Join, similarly if the
>>>>>>> left table is the one being "mapped" in case of a Right Outer Join.
>>>>>>> Let me know if this is not clear, I'd be happy to offer a better
>>>>>>> explanation.
>>>>>>>
>>>>>>> In your query, the where clause on a column called "hour", at this
>>>>>>> point I am unsure if that's a column of table1 or table2. If it's
>>>>>>> column on table1, that predicate would get pushed up (if you have
>>>>>>> hive.optimize.ppd property set to true), so it could possibly be done
>>>>>>> in 1 MR job (I am not sure if that's presently the case, you will
>>>>>>> have
>>>>>>> to check the explain plan). If however, the where clause is on a
>>>>>>> column in the right table (table2 in your example), it can't be
>>>>>>> pushed
>>>>>>> up since a column of the right table can have different values before
>>>>>>> and after the LEFT OUTER JOIN. Therefore, the where clause would need
>>>>>>> to be applied in a separate MR job.
>>>>>>>
>>>>>>> This is just my understanding, the full proof answer would lie in
>>>>>>> checking out the explain plans and the Semantic Analyzer code.
>>>>>>>
>>>>>>> And for completeness, there is a conditional task (starting Hive 0.7)
>>>>>>> that will convert your joins automatically to map joins where
>>>>>>> applicable. This can be enabled by enabling hive.auto.convert.join
>>>>>>> property.
>>>>>>>
>>>>>>> Mark
>>>>>>>
>>>>>>> On Wed, Dec 12, 2012 at 3:32 PM, Chen Song <ch...@gmail.com>
>>>>>>> wrote:
>>>>>>> > I have a silly question on how Hive interpretes a simple query
>>>>>>> with both map
>>>>>>> > side join and group by.
>>>>>>> >
>>>>>>> > Below query will translate into two jobs, with the 1st one as a
>>>>>>> map only job
>>>>>>> > doing the join and storing the output in a intermediary location,
>>>>>>> and the
>>>>>>> > 2nd one as a map-reduce job taking the output of the 1st job as
>>>>>>> input and
>>>>>>> > doing the group by.
>>>>>>> >
>>>>>>> > SELECT
>>>>>>> > /*+ MAPJOIN(d) */
>>>>>>> > table.a, sum(table2.b)
>>>>>>> > from table
>>>>>>> > LEFT OUTER JOIN table2
>>>>>>> > ON table.id = table2.id
>>>>>>> > where hour = '2012-12-11 11'
>>>>>>> > group by table.a
>>>>>>> >
>>>>>>> > Why can't this be done within a single map reduce job? As what I
>>>>>>> can see
>>>>>>> > from the query plan is that all 2nd job mapper do is taking the
>>>>>>> 1st job's
>>>>>>> > mapper output.
>>>>>>> >
>>>>>>> > --
>>>>>>> > Chen Song
>>>>>>> >
>>>>>>> >
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Nitin Pawar
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Chen Song
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Nitin Pawar
>>>>
>>>
>>>
>>>
>>> --
>>> Chen Song
>>>
>>>
>>>
>>
>>
>> --
>> Nitin Pawar
>>
>
>
>
> --
> Chen Song
>
>
>


-- 
Nitin Pawar

Re: map side join with group by

Posted by Chen Song <ch...@gmail.com>.

Nitin

Yeah. My original question is that is there a way to force Hive (or rather
to say, is it possible) to execute map side join at mapper phase and group
by in reduce phase. So instead of launching a map only job (join) and map
reduce job (group by), doing it altogether in a single MR job. This is
obviously not what Hive does but I am wondering if it is a nice feature to
have.

The point you made (different keys in join and group by) only matters when
it is the time in reduce phase, right? As map side join takes care of join
at mapper phase, it sounds to me natural that group by can be done in the
reduce phase in the same job. The only hassle that I can think of is that
map output have to be resorted (based on group by keys).

Chen

On Thu, Dec 13, 2012 at 1:42 PM, Nitin Pawar <ni...@gmail.com>wrote:

> chen in mapside join .. there are no reducers .. its MAP ONLY job
>
>
> On Thu, Dec 13, 2012 at 11:54 PM, Chen Song <ch...@gmail.com>wrote:
>
>> Understood that fact that it is impossible in the same MR job if both
>> join and group by are gonna happen in the reduce phase (because the join
>> keys and group by keys are different). But for map side join, the joins
>> would be complete by the end of the map phase, and outputs should be ready
>> to be distributed to reducers based on group by keys.
>>
>> Chen
>>
>>
>> On Thu, Dec 13, 2012 at 11:04 AM, Nitin Pawar <ni...@gmail.com>wrote:
>>
>>> Thats because for the first job the join keys are different and second
>>> job group by keys are different, you just cant assume join keys and group
>>> keys will be same so they are two different jobs
>>>
>>>
>>> On Thu, Dec 13, 2012 at 8:26 PM, Chen Song <ch...@gmail.com>wrote:
>>>
>>>> Yeah, my abridged version of query might be a little broken but my
>>>> point is that when a query has a map join and group by, even in its
>>>> simplified incarnation, it will launch two jobs. I was just wondering why
>>>> map join and group by cannot be accomplished in one MR job.
>>>>
>>>> Best,
>>>> Chen
>>>>
>>>>
>>>> On Thu, Dec 13, 2012 at 12:30 AM, Nitin Pawar <ni...@gmail.com>wrote:
>>>>
>>>>> I think Chen wanted to know why this is two phased query if I
>>>>> understood it correctly
>>>>>
>>>>> When you run a mapside join .. it just performs the join query ..
>>>>> after that to execute the group by part it launches the second job.
>>>>> I may be wrong but this is how I saw it whenever I executed group by
>>>>> queries
>>>>>
>>>>>
>>>>> On Thu, Dec 13, 2012 at 7:11 AM, Mark Grover <
>>>>> grover.markgrover@gmail.com> wrote:
>>>>>
>>>>>> Hi Chen,
>>>>>> I think we would need some more information.
>>>>>>
>>>>>> The query is referring to a table called "d" in the MAPJOIN hint but
>>>>>> there is not such table in the query. Moreover, Map joins only make
>>>>>> sense when the right table is the one being "mapped" (in other words,
>>>>>> being kept in memory) in case of a Left Outer Join, similarly if the
>>>>>> left table is the one being "mapped" in case of a Right Outer Join.
>>>>>> Let me know if this is not clear, I'd be happy to offer a better
>>>>>> explanation.
>>>>>>
>>>>>> In your query, the where clause on a column called "hour", at this
>>>>>> point I am unsure if that's a column of table1 or table2. If it's
>>>>>> column on table1, that predicate would get pushed up (if you have
>>>>>> hive.optimize.ppd property set to true), so it could possibly be done
>>>>>> in 1 MR job (I am not sure if that's presently the case, you will have
>>>>>> to check the explain plan). If however, the where clause is on a
>>>>>> column in the right table (table2 in your example), it can't be pushed
>>>>>> up since a column of the right table can have different values before
>>>>>> and after the LEFT OUTER JOIN. Therefore, the where clause would need
>>>>>> to be applied in a separate MR job.
>>>>>>
>>>>>> This is just my understanding, the full proof answer would lie in
>>>>>> checking out the explain plans and the Semantic Analyzer code.
>>>>>>
>>>>>> And for completeness, there is a conditional task (starting Hive 0.7)
>>>>>> that will convert your joins automatically to map joins where
>>>>>> applicable. This can be enabled by enabling hive.auto.convert.join
>>>>>> property.
>>>>>>
>>>>>> Mark
>>>>>>
>>>>>> On Wed, Dec 12, 2012 at 3:32 PM, Chen Song <ch...@gmail.com>
>>>>>> wrote:
>>>>>> > I have a silly question on how Hive interpretes a simple query with
>>>>>> both map
>>>>>> > side join and group by.
>>>>>> >
>>>>>> > Below query will translate into two jobs, with the 1st one as a map
>>>>>> only job
>>>>>> > doing the join and storing the output in a intermediary location,
>>>>>> and the
>>>>>> > 2nd one as a map-reduce job taking the output of the 1st job as
>>>>>> input and
>>>>>> > doing the group by.
>>>>>> >
>>>>>> > SELECT
>>>>>> > /*+ MAPJOIN(d) */
>>>>>> > table.a, sum(table2.b)
>>>>>> > from table
>>>>>> > LEFT OUTER JOIN table2
>>>>>> > ON table.id = table2.id
>>>>>> > where hour = '2012-12-11 11'
>>>>>> > group by table.a
>>>>>> >
>>>>>> > Why can't this be done within a single map reduce job? As what I
>>>>>> can see
>>>>>> > from the query plan is that all 2nd job mapper do is taking the 1st
>>>>>> job's
>>>>>> > mapper output.
>>>>>> >
>>>>>> > --
>>>>>> > Chen Song
>>>>>> >
>>>>>> >
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Nitin Pawar
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Chen Song
>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Nitin Pawar
>>>
>>
>>
>>
>> --
>> Chen Song
>>
>>
>>
>
>
> --
> Nitin Pawar
>



-- 
Chen Song

Re: map side join with group by

Posted by Nitin Pawar <ni...@gmail.com>.

chen in mapside join .. there are no reducers .. its MAP ONLY job


On Thu, Dec 13, 2012 at 11:54 PM, Chen Song <ch...@gmail.com> wrote:

> Understood that fact that it is impossible in the same MR job if both join
> and group by are gonna happen in the reduce phase (because the join keys
> and group by keys are different). But for map side join, the joins would be
> complete by the end of the map phase, and outputs should be ready to be
> distributed to reducers based on group by keys.
>
> Chen
>
>
> On Thu, Dec 13, 2012 at 11:04 AM, Nitin Pawar <ni...@gmail.com>wrote:
>
>> Thats because for the first job the join keys are different and second
>> job group by keys are different, you just cant assume join keys and group
>> keys will be same so they are two different jobs
>>
>>
>> On Thu, Dec 13, 2012 at 8:26 PM, Chen Song <ch...@gmail.com>wrote:
>>
>>> Yeah, my abridged version of query might be a little broken but my point
>>> is that when a query has a map join and group by, even in its simplified
>>> incarnation, it will launch two jobs. I was just wondering why map join and
>>> group by cannot be accomplished in one MR job.
>>>
>>> Best,
>>> Chen
>>>
>>>
>>> On Thu, Dec 13, 2012 at 12:30 AM, Nitin Pawar <ni...@gmail.com>wrote:
>>>
>>>> I think Chen wanted to know why this is two phased query if I
>>>> understood it correctly
>>>>
>>>> When you run a mapside join .. it just performs the join query .. after
>>>> that to execute the group by part it launches the second job.
>>>> I may be wrong but this is how I saw it whenever I executed group by
>>>> queries
>>>>
>>>>
>>>> On Thu, Dec 13, 2012 at 7:11 AM, Mark Grover <
>>>> grover.markgrover@gmail.com> wrote:
>>>>
>>>>> Hi Chen,
>>>>> I think we would need some more information.
>>>>>
>>>>> The query is referring to a table called "d" in the MAPJOIN hint but
>>>>> there is not such table in the query. Moreover, Map joins only make
>>>>> sense when the right table is the one being "mapped" (in other words,
>>>>> being kept in memory) in case of a Left Outer Join, similarly if the
>>>>> left table is the one being "mapped" in case of a Right Outer Join.
>>>>> Let me know if this is not clear, I'd be happy to offer a better
>>>>> explanation.
>>>>>
>>>>> In your query, the where clause on a column called "hour", at this
>>>>> point I am unsure if that's a column of table1 or table2. If it's
>>>>> column on table1, that predicate would get pushed up (if you have
>>>>> hive.optimize.ppd property set to true), so it could possibly be done
>>>>> in 1 MR job (I am not sure if that's presently the case, you will have
>>>>> to check the explain plan). If however, the where clause is on a
>>>>> column in the right table (table2 in your example), it can't be pushed
>>>>> up since a column of the right table can have different values before
>>>>> and after the LEFT OUTER JOIN. Therefore, the where clause would need
>>>>> to be applied in a separate MR job.
>>>>>
>>>>> This is just my understanding, the full proof answer would lie in
>>>>> checking out the explain plans and the Semantic Analyzer code.
>>>>>
>>>>> And for completeness, there is a conditional task (starting Hive 0.7)
>>>>> that will convert your joins automatically to map joins where
>>>>> applicable. This can be enabled by enabling hive.auto.convert.join
>>>>> property.
>>>>>
>>>>> Mark
>>>>>
>>>>> On Wed, Dec 12, 2012 at 3:32 PM, Chen Song <ch...@gmail.com>
>>>>> wrote:
>>>>> > I have a silly question on how Hive interpretes a simple query with
>>>>> both map
>>>>> > side join and group by.
>>>>> >
>>>>> > Below query will translate into two jobs, with the 1st one as a map
>>>>> only job
>>>>> > doing the join and storing the output in a intermediary location,
>>>>> and the
>>>>> > 2nd one as a map-reduce job taking the output of the 1st job as
>>>>> input and
>>>>> > doing the group by.
>>>>> >
>>>>> > SELECT
>>>>> > /*+ MAPJOIN(d) */
>>>>> > table.a, sum(table2.b)
>>>>> > from table
>>>>> > LEFT OUTER JOIN table2
>>>>> > ON table.id = table2.id
>>>>> > where hour = '2012-12-11 11'
>>>>> > group by table.a
>>>>> >
>>>>> > Why can't this be done within a single map reduce job? As what I can
>>>>> see
>>>>> > from the query plan is that all 2nd job mapper do is taking the 1st
>>>>> job's
>>>>> > mapper output.
>>>>> >
>>>>> > --
>>>>> > Chen Song
>>>>> >
>>>>> >
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Nitin Pawar
>>>>
>>>
>>>
>>>
>>> --
>>> Chen Song
>>>
>>>
>>>
>>
>>
>> --
>> Nitin Pawar
>>
>
>
>
> --
> Chen Song
>
>
>


-- 
Nitin Pawar

Re: map side join with group by

Posted by Chen Song <ch...@gmail.com>.

Understood that fact that it is impossible in the same MR job if both join
and group by are gonna happen in the reduce phase (because the join keys
and group by keys are different). But for map side join, the joins would be
complete by the end of the map phase, and outputs should be ready to be
distributed to reducers based on group by keys.

Chen

On Thu, Dec 13, 2012 at 11:04 AM, Nitin Pawar <ni...@gmail.com>wrote:

> Thats because for the first job the join keys are different and second job
> group by keys are different, you just cant assume join keys and group keys
> will be same so they are two different jobs
>
>
> On Thu, Dec 13, 2012 at 8:26 PM, Chen Song <ch...@gmail.com> wrote:
>
>> Yeah, my abridged version of query might be a little broken but my point
>> is that when a query has a map join and group by, even in its simplified
>> incarnation, it will launch two jobs. I was just wondering why map join and
>> group by cannot be accomplished in one MR job.
>>
>> Best,
>> Chen
>>
>>
>> On Thu, Dec 13, 2012 at 12:30 AM, Nitin Pawar <ni...@gmail.com>wrote:
>>
>>> I think Chen wanted to know why this is two phased query if I understood
>>> it correctly
>>>
>>> When you run a mapside join .. it just performs the join query .. after
>>> that to execute the group by part it launches the second job.
>>> I may be wrong but this is how I saw it whenever I executed group by
>>> queries
>>>
>>>
>>> On Thu, Dec 13, 2012 at 7:11 AM, Mark Grover <
>>> grover.markgrover@gmail.com> wrote:
>>>
>>>> Hi Chen,
>>>> I think we would need some more information.
>>>>
>>>> The query is referring to a table called "d" in the MAPJOIN hint but
>>>> there is not such table in the query. Moreover, Map joins only make
>>>> sense when the right table is the one being "mapped" (in other words,
>>>> being kept in memory) in case of a Left Outer Join, similarly if the
>>>> left table is the one being "mapped" in case of a Right Outer Join.
>>>> Let me know if this is not clear, I'd be happy to offer a better
>>>> explanation.
>>>>
>>>> In your query, the where clause on a column called "hour", at this
>>>> point I am unsure if that's a column of table1 or table2. If it's
>>>> column on table1, that predicate would get pushed up (if you have
>>>> hive.optimize.ppd property set to true), so it could possibly be done
>>>> in 1 MR job (I am not sure if that's presently the case, you will have
>>>> to check the explain plan). If however, the where clause is on a
>>>> column in the right table (table2 in your example), it can't be pushed
>>>> up since a column of the right table can have different values before
>>>> and after the LEFT OUTER JOIN. Therefore, the where clause would need
>>>> to be applied in a separate MR job.
>>>>
>>>> This is just my understanding, the full proof answer would lie in
>>>> checking out the explain plans and the Semantic Analyzer code.
>>>>
>>>> And for completeness, there is a conditional task (starting Hive 0.7)
>>>> that will convert your joins automatically to map joins where
>>>> applicable. This can be enabled by enabling hive.auto.convert.join
>>>> property.
>>>>
>>>> Mark
>>>>
>>>> On Wed, Dec 12, 2012 at 3:32 PM, Chen Song <ch...@gmail.com>
>>>> wrote:
>>>> > I have a silly question on how Hive interpretes a simple query with
>>>> both map
>>>> > side join and group by.
>>>> >
>>>> > Below query will translate into two jobs, with the 1st one as a map
>>>> only job
>>>> > doing the join and storing the output in a intermediary location, and
>>>> the
>>>> > 2nd one as a map-reduce job taking the output of the 1st job as input
>>>> and
>>>> > doing the group by.
>>>> >
>>>> > SELECT
>>>> > /*+ MAPJOIN(d) */
>>>> > table.a, sum(table2.b)
>>>> > from table
>>>> > LEFT OUTER JOIN table2
>>>> > ON table.id = table2.id
>>>> > where hour = '2012-12-11 11'
>>>> > group by table.a
>>>> >
>>>> > Why can't this be done within a single map reduce job? As what I can
>>>> see
>>>> > from the query plan is that all 2nd job mapper do is taking the 1st
>>>> job's
>>>> > mapper output.
>>>> >
>>>> > --
>>>> > Chen Song
>>>> >
>>>> >
>>>>
>>>
>>>
>>>
>>> --
>>> Nitin Pawar
>>>
>>
>>
>>
>> --
>> Chen Song
>>
>>
>>
>
>
> --
> Nitin Pawar
>



-- 
Chen Song

Re: map side join with group by

Posted by Nitin Pawar <ni...@gmail.com>.

Thats because for the first job the join keys are different and second job
group by keys are different, you just cant assume join keys and group keys
will be same so they are two different jobs


On Thu, Dec 13, 2012 at 8:26 PM, Chen Song <ch...@gmail.com> wrote:

> Yeah, my abridged version of query might be a little broken but my point
> is that when a query has a map join and group by, even in its simplified
> incarnation, it will launch two jobs. I was just wondering why map join and
> group by cannot be accomplished in one MR job.
>
> Best,
> Chen
>
>
> On Thu, Dec 13, 2012 at 12:30 AM, Nitin Pawar <ni...@gmail.com>wrote:
>
>> I think Chen wanted to know why this is two phased query if I understood
>> it correctly
>>
>> When you run a mapside join .. it just performs the join query .. after
>> that to execute the group by part it launches the second job.
>> I may be wrong but this is how I saw it whenever I executed group by
>> queries
>>
>>
>> On Thu, Dec 13, 2012 at 7:11 AM, Mark Grover <grover.markgrover@gmail.com
>> > wrote:
>>
>>> Hi Chen,
>>> I think we would need some more information.
>>>
>>> The query is referring to a table called "d" in the MAPJOIN hint but
>>> there is not such table in the query. Moreover, Map joins only make
>>> sense when the right table is the one being "mapped" (in other words,
>>> being kept in memory) in case of a Left Outer Join, similarly if the
>>> left table is the one being "mapped" in case of a Right Outer Join.
>>> Let me know if this is not clear, I'd be happy to offer a better
>>> explanation.
>>>
>>> In your query, the where clause on a column called "hour", at this
>>> point I am unsure if that's a column of table1 or table2. If it's
>>> column on table1, that predicate would get pushed up (if you have
>>> hive.optimize.ppd property set to true), so it could possibly be done
>>> in 1 MR job (I am not sure if that's presently the case, you will have
>>> to check the explain plan). If however, the where clause is on a
>>> column in the right table (table2 in your example), it can't be pushed
>>> up since a column of the right table can have different values before
>>> and after the LEFT OUTER JOIN. Therefore, the where clause would need
>>> to be applied in a separate MR job.
>>>
>>> This is just my understanding, the full proof answer would lie in
>>> checking out the explain plans and the Semantic Analyzer code.
>>>
>>> And for completeness, there is a conditional task (starting Hive 0.7)
>>> that will convert your joins automatically to map joins where
>>> applicable. This can be enabled by enabling hive.auto.convert.join
>>> property.
>>>
>>> Mark
>>>
>>> On Wed, Dec 12, 2012 at 3:32 PM, Chen Song <ch...@gmail.com>
>>> wrote:
>>> > I have a silly question on how Hive interpretes a simple query with
>>> both map
>>> > side join and group by.
>>> >
>>> > Below query will translate into two jobs, with the 1st one as a map
>>> only job
>>> > doing the join and storing the output in a intermediary location, and
>>> the
>>> > 2nd one as a map-reduce job taking the output of the 1st job as input
>>> and
>>> > doing the group by.
>>> >
>>> > SELECT
>>> > /*+ MAPJOIN(d) */
>>> > table.a, sum(table2.b)
>>> > from table
>>> > LEFT OUTER JOIN table2
>>> > ON table.id = table2.id
>>> > where hour = '2012-12-11 11'
>>> > group by table.a
>>> >
>>> > Why can't this be done within a single map reduce job? As what I can
>>> see
>>> > from the query plan is that all 2nd job mapper do is taking the 1st
>>> job's
>>> > mapper output.
>>> >
>>> > --
>>> > Chen Song
>>> >
>>> >
>>>
>>
>>
>>
>> --
>> Nitin Pawar
>>
>
>
>
> --
> Chen Song
>
>
>


-- 
Nitin Pawar

Re: map side join with group by

Posted by Chen Song <ch...@gmail.com>.

Yeah, my abridged version of query might be a little broken but my point is
that when a query has a map join and group by, even in its simplified
incarnation, it will launch two jobs. I was just wondering why map join and
group by cannot be accomplished in one MR job.

Best,
Chen

On Thu, Dec 13, 2012 at 12:30 AM, Nitin Pawar <ni...@gmail.com>wrote:

> I think Chen wanted to know why this is two phased query if I understood
> it correctly
>
> When you run a mapside join .. it just performs the join query .. after
> that to execute the group by part it launches the second job.
> I may be wrong but this is how I saw it whenever I executed group by
> queries
>
>
> On Thu, Dec 13, 2012 at 7:11 AM, Mark Grover <gr...@gmail.com>wrote:
>
>> Hi Chen,
>> I think we would need some more information.
>>
>> The query is referring to a table called "d" in the MAPJOIN hint but
>> there is not such table in the query. Moreover, Map joins only make
>> sense when the right table is the one being "mapped" (in other words,
>> being kept in memory) in case of a Left Outer Join, similarly if the
>> left table is the one being "mapped" in case of a Right Outer Join.
>> Let me know if this is not clear, I'd be happy to offer a better
>> explanation.
>>
>> In your query, the where clause on a column called "hour", at this
>> point I am unsure if that's a column of table1 or table2. If it's
>> column on table1, that predicate would get pushed up (if you have
>> hive.optimize.ppd property set to true), so it could possibly be done
>> in 1 MR job (I am not sure if that's presently the case, you will have
>> to check the explain plan). If however, the where clause is on a
>> column in the right table (table2 in your example), it can't be pushed
>> up since a column of the right table can have different values before
>> and after the LEFT OUTER JOIN. Therefore, the where clause would need
>> to be applied in a separate MR job.
>>
>> This is just my understanding, the full proof answer would lie in
>> checking out the explain plans and the Semantic Analyzer code.
>>
>> And for completeness, there is a conditional task (starting Hive 0.7)
>> that will convert your joins automatically to map joins where
>> applicable. This can be enabled by enabling hive.auto.convert.join
>> property.
>>
>> Mark
>>
>> On Wed, Dec 12, 2012 at 3:32 PM, Chen Song <ch...@gmail.com>
>> wrote:
>> > I have a silly question on how Hive interpretes a simple query with
>> both map
>> > side join and group by.
>> >
>> > Below query will translate into two jobs, with the 1st one as a map
>> only job
>> > doing the join and storing the output in a intermediary location, and
>> the
>> > 2nd one as a map-reduce job taking the output of the 1st job as input
>> and
>> > doing the group by.
>> >
>> > SELECT
>> > /*+ MAPJOIN(d) */
>> > table.a, sum(table2.b)
>> > from table
>> > LEFT OUTER JOIN table2
>> > ON table.id = table2.id
>> > where hour = '2012-12-11 11'
>> > group by table.a
>> >
>> > Why can't this be done within a single map reduce job? As what I can see
>> > from the query plan is that all 2nd job mapper do is taking the 1st
>> job's
>> > mapper output.
>> >
>> > --
>> > Chen Song
>> >
>> >
>>
>
>
>
> --
> Nitin Pawar
>



-- 
Chen Song

Re: map side join with group by

Posted by Nitin Pawar <ni...@gmail.com>.

I think Chen wanted to know why this is two phased query if I understood it
correctly

When you run a mapside join .. it just performs the join query .. after
that to execute the group by part it launches the second job.
I may be wrong but this is how I saw it whenever I executed group by
queries


On Thu, Dec 13, 2012 at 7:11 AM, Mark Grover <gr...@gmail.com>wrote:

> Hi Chen,
> I think we would need some more information.
>
> The query is referring to a table called "d" in the MAPJOIN hint but
> there is not such table in the query. Moreover, Map joins only make
> sense when the right table is the one being "mapped" (in other words,
> being kept in memory) in case of a Left Outer Join, similarly if the
> left table is the one being "mapped" in case of a Right Outer Join.
> Let me know if this is not clear, I'd be happy to offer a better
> explanation.
>
> In your query, the where clause on a column called "hour", at this
> point I am unsure if that's a column of table1 or table2. If it's
> column on table1, that predicate would get pushed up (if you have
> hive.optimize.ppd property set to true), so it could possibly be done
> in 1 MR job (I am not sure if that's presently the case, you will have
> to check the explain plan). If however, the where clause is on a
> column in the right table (table2 in your example), it can't be pushed
> up since a column of the right table can have different values before
> and after the LEFT OUTER JOIN. Therefore, the where clause would need
> to be applied in a separate MR job.
>
> This is just my understanding, the full proof answer would lie in
> checking out the explain plans and the Semantic Analyzer code.
>
> And for completeness, there is a conditional task (starting Hive 0.7)
> that will convert your joins automatically to map joins where
> applicable. This can be enabled by enabling hive.auto.convert.join
> property.
>
> Mark
>
> On Wed, Dec 12, 2012 at 3:32 PM, Chen Song <ch...@gmail.com> wrote:
> > I have a silly question on how Hive interpretes a simple query with both
> map
> > side join and group by.
> >
> > Below query will translate into two jobs, with the 1st one as a map only
> job
> > doing the join and storing the output in a intermediary location, and the
> > 2nd one as a map-reduce job taking the output of the 1st job as input and
> > doing the group by.
> >
> > SELECT
> > /*+ MAPJOIN(d) */
> > table.a, sum(table2.b)
> > from table
> > LEFT OUTER JOIN table2
> > ON table.id = table2.id
> > where hour = '2012-12-11 11'
> > group by table.a
> >
> > Why can't this be done within a single map reduce job? As what I can see
> > from the query plan is that all 2nd job mapper do is taking the 1st job's
> > mapper output.
> >
> > --
> > Chen Song
> >
> >
>



-- 
Nitin Pawar

Re: map side join with group by

Posted by Mark Grover <gr...@gmail.com>.

Hi Chen,
I think we would need some more information.

The query is referring to a table called "d" in the MAPJOIN hint but
there is not such table in the query. Moreover, Map joins only make
sense when the right table is the one being "mapped" (in other words,
being kept in memory) in case of a Left Outer Join, similarly if the
left table is the one being "mapped" in case of a Right Outer Join.
Let me know if this is not clear, I'd be happy to offer a better
explanation.

In your query, the where clause on a column called "hour", at this
point I am unsure if that's a column of table1 or table2. If it's
column on table1, that predicate would get pushed up (if you have
hive.optimize.ppd property set to true), so it could possibly be done
in 1 MR job (I am not sure if that's presently the case, you will have
to check the explain plan). If however, the where clause is on a
column in the right table (table2 in your example), it can't be pushed
up since a column of the right table can have different values before
and after the LEFT OUTER JOIN. Therefore, the where clause would need
to be applied in a separate MR job.

This is just my understanding, the full proof answer would lie in
checking out the explain plans and the Semantic Analyzer code.

And for completeness, there is a conditional task (starting Hive 0.7)
that will convert your joins automatically to map joins where
applicable. This can be enabled by enabling hive.auto.convert.join
property.

Mark

On Wed, Dec 12, 2012 at 3:32 PM, Chen Song <ch...@gmail.com> wrote:
> I have a silly question on how Hive interpretes a simple query with both map
> side join and group by.
>
> Below query will translate into two jobs, with the 1st one as a map only job
> doing the join and storing the output in a intermediary location, and the
> 2nd one as a map-reduce job taking the output of the 1st job as input and
> doing the group by.
>
> SELECT
> /*+ MAPJOIN(d) */
> table.a, sum(table2.b)
> from table
> LEFT OUTER JOIN table2
> ON table.id = table2.id
> where hour = '2012-12-11 11'
> group by table.a
>
> Why can't this be done within a single map reduce job? As what I can see
> from the query plan is that all 2nd job mapper do is taking the 1st job's
> mapper output.
>
> --
> Chen Song
>
>