You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hive.apache.org by "Namit Jain (JIRA)" <ji...@apache.org> on 2010/02/24 19:45:30 UTC

[jira] Created: (HIVE-1194) sorted merge join

sorted merge join
-----------------

                 Key: HIVE-1194
                 URL: https://issues.apache.org/jira/browse/HIVE-1194
             Project: Hadoop Hive
          Issue Type: New Feature
          Components: Query Processor
            Reporter: Namit Jain
            Assignee: He Yongqiang
             Fix For: 0.6.0


If the input tables are sorted on the join key, and a mapjoin is being performed, it is useful to exploit the sorted properties of the table.
This can lead to substantial cpu savings - this needs to work across bucketed map joins also.

Since, sorted properties of a table are not enforced currently, a new parameter can be added to specify to use the sort-merge join.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-1194) sorted merge join

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-1194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12840877#action_12840877 ] 

Namit Jain commented on HIVE-1194:
----------------------------------

Reviewed with Yongqiang online - 

MapJoinProcessor.java:convertMapJoin: Also check if the tables are sorted.
(check it later in SMBJoinOptimizer)

Add a negative test for the same.

Also, can you add a simple test with 2 buckets ?


> sorted merge join
> -----------------
>
>                 Key: HIVE-1194
>                 URL: https://issues.apache.org/jira/browse/HIVE-1194
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: He Yongqiang
>             Fix For: 0.6.0
>
>         Attachments: hive-1194-2010-02-28.patch, hive-1194-2010-3-2.2.patch, hive-1194-2010-3-2.patch
>
>
> If the input tables are sorted on the join key, and a mapjoin is being performed, it is useful to exploit the sorted properties of the table.
> This can lead to substantial cpu savings - this needs to work across bucketed map joins also.
> Since, sorted properties of a table are not enforced currently, a new parameter can be added to specify to use the sort-merge join.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-1194) sorted merge join

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-1194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12841550#action_12841550 ] 

Namit Jain commented on HIVE-1194:
----------------------------------

+1

looks good - will commit if the tests pass

> sorted merge join
> -----------------
>
>                 Key: HIVE-1194
>                 URL: https://issues.apache.org/jira/browse/HIVE-1194
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: He Yongqiang
>             Fix For: 0.6.0
>
>         Attachments: hive-1194-2010-02-28.patch, hive-1194-2010-3-2.2.patch, hive-1194-2010-3-2.patch, hive-1194-2010-3-3-2.patch, hive-1194-2010-3-3.patch, hive-1194-2010-3-4.patch
>
>
> If the input tables are sorted on the join key, and a mapjoin is being performed, it is useful to exploit the sorted properties of the table.
> This can lead to substantial cpu savings - this needs to work across bucketed map joins also.
> Since, sorted properties of a table are not enforced currently, a new parameter can be added to specify to use the sort-merge join.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-1194) sorted merge join

Posted by "He Yongqiang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-1194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12840463#action_12840463 ] 

He Yongqiang commented on HIVE-1194:
------------------------------------

hive-1194-2010-3-2.2.patch fixed a bug in outer joins with more than 2 tables.

> sorted merge join
> -----------------
>
>                 Key: HIVE-1194
>                 URL: https://issues.apache.org/jira/browse/HIVE-1194
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: He Yongqiang
>             Fix For: 0.6.0
>
>         Attachments: hive-1194-2010-02-28.patch, hive-1194-2010-3-2.2.patch, hive-1194-2010-3-2.patch
>
>
> If the input tables are sorted on the join key, and a mapjoin is being performed, it is useful to exploit the sorted properties of the table.
> This can lead to substantial cpu savings - this needs to work across bucketed map joins also.
> Since, sorted properties of a table are not enforced currently, a new parameter can be added to specify to use the sort-merge join.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HIVE-1194) sorted merge join

Posted by "He Yongqiang (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-1194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

He Yongqiang updated HIVE-1194:
-------------------------------

    Attachment: hive-1194-2010-3-3-2.patch

a new one added the reportProgress

> sorted merge join
> -----------------
>
>                 Key: HIVE-1194
>                 URL: https://issues.apache.org/jira/browse/HIVE-1194
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: He Yongqiang
>             Fix For: 0.6.0
>
>         Attachments: hive-1194-2010-02-28.patch, hive-1194-2010-3-2.2.patch, hive-1194-2010-3-2.patch, hive-1194-2010-3-3-2.patch, hive-1194-2010-3-3.patch
>
>
> If the input tables are sorted on the join key, and a mapjoin is being performed, it is useful to exploit the sorted properties of the table.
> This can lead to substantial cpu savings - this needs to work across bucketed map joins also.
> Since, sorted properties of a table are not enforced currently, a new parameter can be added to specify to use the sort-merge join.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-1194) sorted merge join

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-1194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12841109#action_12841109 ] 

Namit Jain commented on HIVE-1194:
----------------------------------

Verified problem 2. above again - in the first query

> sorted merge join
> -----------------
>
>                 Key: HIVE-1194
>                 URL: https://issues.apache.org/jira/browse/HIVE-1194
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: He Yongqiang
>             Fix For: 0.6.0
>
>         Attachments: hive-1194-2010-02-28.patch, hive-1194-2010-3-2.2.patch, hive-1194-2010-3-2.patch, hive-1194-2010-3-3-2.patch, hive-1194-2010-3-3.patch
>
>
> If the input tables are sorted on the join key, and a mapjoin is being performed, it is useful to exploit the sorted properties of the table.
> This can lead to substantial cpu savings - this needs to work across bucketed map joins also.
> Since, sorted properties of a table are not enforced currently, a new parameter can be added to specify to use the sort-merge join.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-1194) sorted merge join

Posted by "He Yongqiang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-1194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12838191#action_12838191 ] 

He Yongqiang commented on HIVE-1194:
------------------------------------

Thanks Zheng. Yes, we should do that.

> sorted merge join
> -----------------
>
>                 Key: HIVE-1194
>                 URL: https://issues.apache.org/jira/browse/HIVE-1194
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: He Yongqiang
>             Fix For: 0.6.0
>
>
> If the input tables are sorted on the join key, and a mapjoin is being performed, it is useful to exploit the sorted properties of the table.
> This can lead to substantial cpu savings - this needs to work across bucketed map joins also.
> Since, sorted properties of a table are not enforced currently, a new parameter can be added to specify to use the sort-merge join.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-1194) sorted merge join

Posted by "He Yongqiang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-1194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12839932#action_12839932 ] 

He Yongqiang commented on HIVE-1194:
------------------------------------

yes, we can do that. there are two problems need to resolve;
(1) serialize and deserialize the mapping. We generate the mapping at compile time, and the operator instance is different then the one in runtime. 
(2) the fetchOperators need to be accessed in SMBMapJoinOperator. need to pass these from exec-mapper to SMBMapJoinOperator

I just made a small changes,
i added a new method initializeLocalWork() in Operator. In exec-mapper, the mapoperator's initializeLocalWork() is called, and triggered all its children's initializeLocalWork().

> sorted merge join
> -----------------
>
>                 Key: HIVE-1194
>                 URL: https://issues.apache.org/jira/browse/HIVE-1194
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: He Yongqiang
>             Fix For: 0.6.0
>
>         Attachments: hive-1194-2010-02-28.patch
>
>
> If the input tables are sorted on the join key, and a mapjoin is being performed, it is useful to exploit the sorted properties of the table.
> This can lead to substantial cpu savings - this needs to work across bucketed map joins also.
> Since, sorted properties of a table are not enforced currently, a new parameter can be added to specify to use the sort-merge join.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-1194) sorted merge join

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-1194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12839922#action_12839922 ] 

Namit Jain commented on HIVE-1194:
----------------------------------

Had a quick comment - dont you need a operator->fetcoperator mapping in mapredlocalwork ?
currently, you are implicitly assuming that mapjoins are the only operators doing so.

> sorted merge join
> -----------------
>
>                 Key: HIVE-1194
>                 URL: https://issues.apache.org/jira/browse/HIVE-1194
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: He Yongqiang
>             Fix For: 0.6.0
>
>         Attachments: hive-1194-2010-02-28.patch
>
>
> If the input tables are sorted on the join key, and a mapjoin is being performed, it is useful to exploit the sorted properties of the table.
> This can lead to substantial cpu savings - this needs to work across bucketed map joins also.
> Since, sorted properties of a table are not enforced currently, a new parameter can be added to specify to use the sort-merge join.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HIVE-1194) sorted merge join

Posted by "He Yongqiang (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-1194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

He Yongqiang updated HIVE-1194:
-------------------------------

    Attachment: hive-1194-2010-02-28.patch

for early review only. 
I will test it more and add more testcases.

> sorted merge join
> -----------------
>
>                 Key: HIVE-1194
>                 URL: https://issues.apache.org/jira/browse/HIVE-1194
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: He Yongqiang
>             Fix For: 0.6.0
>
>         Attachments: hive-1194-2010-02-28.patch
>
>
> If the input tables are sorted on the join key, and a mapjoin is being performed, it is useful to exploit the sorted properties of the table.
> This can lead to substantial cpu savings - this needs to work across bucketed map joins also.
> Since, sorted properties of a table are not enforced currently, a new parameter can be added to specify to use the sort-merge join.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-1194) sorted merge join

Posted by "He Yongqiang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-1194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12840812#action_12840812 ] 

He Yongqiang commented on HIVE-1194:
------------------------------------

will take a look now.

> sorted merge join
> -----------------
>
>                 Key: HIVE-1194
>                 URL: https://issues.apache.org/jira/browse/HIVE-1194
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: He Yongqiang
>             Fix For: 0.6.0
>
>         Attachments: hive-1194-2010-02-28.patch, hive-1194-2010-3-2.2.patch, hive-1194-2010-3-2.patch
>
>
> If the input tables are sorted on the join key, and a mapjoin is being performed, it is useful to exploit the sorted properties of the table.
> This can lead to substantial cpu savings - this needs to work across bucketed map joins also.
> Since, sorted properties of a table are not enforced currently, a new parameter can be added to specify to use the sort-merge join.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-1194) sorted merge join

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-1194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12840806#action_12840806 ] 

Namit Jain commented on HIVE-1194:
----------------------------------


PREHOOK: query: select /*+mapjoin(a,b)*/ * from smb_bucket_1 a right outer join smb_bucket_2 b on a.key \
= b.key join smb_bucket_3 c on b.key=c.key
PREHOOK: type: QUERY
PREHOOK: Input: default@smb_bucket_2
PREHOOK: Input: default@smb_bucket_3
PREHOOK: Input: default@smb_bucket_1
PREHOOK: Output: file:/Users/heyongqiang/Documents/workspace/Hive-Test/build/ql/scratchdir/hive_2010-03-\
02_16-29-05_320_5840475035790004401/10000
POSTHOOK: query: select /*+mapjoin(a,b)*/ * from smb_bucket_1 a right outer join smb_bucket_2 b on a.key\
 = b.key join smb_bucket_3 c on b.key=c.key
POSTHOOK: type: QUERY
POSTHOOK: Input: default@smb_bucket_2
POSTHOOK: Input: default@smb_bucket_3
POSTHOOK: Input: default@smb_bucket_1
POSTHOOK: Output: file:/Users/heyongqiang/Documents/workspace/Hive-Test/build/ql/scratchdir/hive_2010-03\
-02_16-29-05_320_5840475035790004401/10000


Why is this giving a empty result ?

> sorted merge join
> -----------------
>
>                 Key: HIVE-1194
>                 URL: https://issues.apache.org/jira/browse/HIVE-1194
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: He Yongqiang
>             Fix For: 0.6.0
>
>         Attachments: hive-1194-2010-02-28.patch, hive-1194-2010-3-2.2.patch, hive-1194-2010-3-2.patch
>
>
> If the input tables are sorted on the join key, and a mapjoin is being performed, it is useful to exploit the sorted properties of the table.
> This can lead to substantial cpu savings - this needs to work across bucketed map joins also.
> Since, sorted properties of a table are not enforced currently, a new parameter can be added to specify to use the sort-merge join.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-1194) sorted merge join

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-1194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12840807#action_12840807 ] 

Namit Jain commented on HIVE-1194:
----------------------------------

PREHOOK: query: select /*+mapjoin(a,b)*/ * from smb_bucket_1 a right outer join smb_bucket_2 b on a.key \
= b.key right outer join smb_bucket_3 c on b.key=c.key
PREHOOK: type: QUERY
PREHOOK: Input: default@smb_bucket_2
PREHOOK: Input: default@smb_bucket_3
PREHOOK: Input: default@smb_bucket_1
PREHOOK: Output: file:/Users/heyongqiang/Documents/workspace/Hive-Test/build/ql/scratchdir/hive_2010-03-\
02_16-29-16_626_5515675647620051128/10000
POSTHOOK: query: select /*+mapjoin(a,b)*/ * from smb_bucket_1 a right outer join smb_bucket_2 b on a.key\
 = b.key right outer join smb_bucket_3 c on b.key=c.key
POSTHOOK: type: QUERY
POSTHOOK: Input: default@smb_bucket_2
POSTHOOK: Input: default@smb_bucket_3
POSTHOOK: Input: default@smb_bucket_1
POSTHOOK: Output: file:/Users/heyongqiang/Documents/workspace/Hive-Test/build/ql/scratchdir/hive_2010-03\
-02_16-29-16_626_5515675647620051128/10000
NULL  NULL  NULL  NULL  4 val_4
NULL  NULL  NULL  NULL  10  val_10
NULL  NULL  NULL  NULL  17  val_17
NULL  NULL  NULL  NULL  19  val_19
NULL  NULL  NULL  NULL  20  val_20
NULL  NULL  NULL  NULL  23  val_23


Even this one looks wrong - can you take a look in detail ?

> sorted merge join
> -----------------
>
>                 Key: HIVE-1194
>                 URL: https://issues.apache.org/jira/browse/HIVE-1194
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: He Yongqiang
>             Fix For: 0.6.0
>
>         Attachments: hive-1194-2010-02-28.patch, hive-1194-2010-3-2.2.patch, hive-1194-2010-3-2.patch
>
>
> If the input tables are sorted on the join key, and a mapjoin is being performed, it is useful to exploit the sorted properties of the table.
> This can lead to substantial cpu savings - this needs to work across bucketed map joins also.
> Since, sorted properties of a table are not enforced currently, a new parameter can be added to specify to use the sort-merge join.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-1194) sorted merge join

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-1194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12840805#action_12840805 ] 

Namit Jain commented on HIVE-1194:
----------------------------------

POSTHOOK: query: select /*+mapjoin(a,b)*/ * from smb_bucket_1 a left outer join smb_bucket_2 b on a.key \
= b.key full outer join smb_bucket_3 c on b.key=c.key
POSTHOOK: type: QUERY
POSTHOOK: Input: default@smb_bucket_2
POSTHOOK: Input: default@smb_bucket_3
POSTHOOK: Input: default@smb_bucket_1
POSTHOOK: Output: file:/Users/heyongqiang/Documents/workspace/Hive-Test/build/ql/scratchdir/hive_2010-03\
-02_16-28-56_475_666795559542199348/10000
1 val_1 NULL  NULL  NULL  NULL
3 val_3 NULL  NULL  NULL  NULL
4 val_4 NULL  NULL  NULL  NULL
NULL  NULL  NULL  NULL  4 val_4
5 val_5 NULL  NULL  NULL  NULL
10  val_10  NULL  NULL  NULL  NULL
NULL  NULL  NULL  NULL  10  val_10
NULL  NULL  NULL  NULL  17  val_17
NULL  NULL  NULL  NULL  19  val_19
NULL  NULL  NULL  NULL  20  val_20
NULL  NULL  NULL  NULL  23  val_23


same as above

> sorted merge join
> -----------------
>
>                 Key: HIVE-1194
>                 URL: https://issues.apache.org/jira/browse/HIVE-1194
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: He Yongqiang
>             Fix For: 0.6.0
>
>         Attachments: hive-1194-2010-02-28.patch, hive-1194-2010-3-2.2.patch, hive-1194-2010-3-2.patch
>
>
> If the input tables are sorted on the join key, and a mapjoin is being performed, it is useful to exploit the sorted properties of the table.
> This can lead to substantial cpu savings - this needs to work across bucketed map joins also.
> Since, sorted properties of a table are not enforced currently, a new parameter can be added to specify to use the sort-merge join.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-1194) sorted merge join

Posted by "He Yongqiang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-1194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12841414#action_12841414 ] 

He Yongqiang commented on HIVE-1194:
------------------------------------

@namit,

498's join results is in the results:

496	val_496	496	val_496
498	val_498	498	val_498
498	val_498	498	val_498
498	val_498	498	val_498
498	val_498	498	val_498
498	val_498	498	val_498
498	val_498	498	val_498
498	val_498	498	val_498
498	val_498	498	val_498
498	val_498	498	val_498
5	val_5	5	val_5
5	val_5	5	val_5
5	val_5	5	val_5
5	val_5	5	val_5
5	val_5	5	val_5
5	val_5	5	val_5
5	val_5	5	val_5
5	val_5	5	val_5
5	val_5	5	val_5
9	val_9	9	val_9


I will add a automatic check query in the test and upload a new one.

> sorted merge join
> -----------------
>
>                 Key: HIVE-1194
>                 URL: https://issues.apache.org/jira/browse/HIVE-1194
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: He Yongqiang
>             Fix For: 0.6.0
>
>         Attachments: hive-1194-2010-02-28.patch, hive-1194-2010-3-2.2.patch, hive-1194-2010-3-2.patch, hive-1194-2010-3-3-2.patch, hive-1194-2010-3-3.patch
>
>
> If the input tables are sorted on the join key, and a mapjoin is being performed, it is useful to exploit the sorted properties of the table.
> This can lead to substantial cpu savings - this needs to work across bucketed map joins also.
> Since, sorted properties of a table are not enforced currently, a new parameter can be added to specify to use the sort-merge join.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-1194) sorted merge join

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-1194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12839935#action_12839935 ] 

Namit Jain commented on HIVE-1194:
----------------------------------

There is a operator id which is unique - so the problem of different operator instance can be solved

Each operator will access its local work. Currently, only map join operators will need them.
MapJoinOperator will get the complete small table in the beginning, whereas SMBJoinOperator reads it
row by row.

ExecMapper does nothing

> sorted merge join
> -----------------
>
>                 Key: HIVE-1194
>                 URL: https://issues.apache.org/jira/browse/HIVE-1194
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: He Yongqiang
>             Fix For: 0.6.0
>
>         Attachments: hive-1194-2010-02-28.patch
>
>
> If the input tables are sorted on the join key, and a mapjoin is being performed, it is useful to exploit the sorted properties of the table.
> This can lead to substantial cpu savings - this needs to work across bucketed map joins also.
> Since, sorted properties of a table are not enforced currently, a new parameter can be added to specify to use the sort-merge join.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HIVE-1194) sorted merge join

Posted by "He Yongqiang (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-1194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

He Yongqiang updated HIVE-1194:
-------------------------------

    Attachment: hive-1194-2010-3-3.patch

A new patch integrates Namit and Siying's comments. Thanks Namit and Siying!

> sorted merge join
> -----------------
>
>                 Key: HIVE-1194
>                 URL: https://issues.apache.org/jira/browse/HIVE-1194
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: He Yongqiang
>             Fix For: 0.6.0
>
>         Attachments: hive-1194-2010-02-28.patch, hive-1194-2010-3-2.2.patch, hive-1194-2010-3-2.patch, hive-1194-2010-3-3.patch
>
>
> If the input tables are sorted on the join key, and a mapjoin is being performed, it is useful to exploit the sorted properties of the table.
> This can lead to substantial cpu savings - this needs to work across bucketed map joins also.
> Since, sorted properties of a table are not enforced currently, a new parameter can be added to specify to use the sort-merge join.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-1194) sorted merge join

Posted by "He Yongqiang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-1194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12840841#action_12840841 ] 

He Yongqiang commented on HIVE-1194:
------------------------------------

checked with mysql,
for query:
select /+mapjoin(a,b)/ * from smb_bucket_1 a left outer join smb_bucket_2 b on a.key \
= b.key left outer join smb_bucket_3 c on b.key=c.key
the result is consistent.

i did not check the second query because mysql does not support full outer join

> sorted merge join
> -----------------
>
>                 Key: HIVE-1194
>                 URL: https://issues.apache.org/jira/browse/HIVE-1194
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: He Yongqiang
>             Fix For: 0.6.0
>
>         Attachments: hive-1194-2010-02-28.patch, hive-1194-2010-3-2.2.patch, hive-1194-2010-3-2.patch
>
>
> If the input tables are sorted on the join key, and a mapjoin is being performed, it is useful to exploit the sorted properties of the table.
> This can lead to substantial cpu savings - this needs to work across bucketed map joins also.
> Since, sorted properties of a table are not enforced currently, a new parameter can be added to specify to use the sort-merge join.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-1194) sorted merge join

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-1194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12840802#action_12840802 ] 

Namit Jain commented on HIVE-1194:
----------------------------------

smb_mapjoin4.q:


POSTHOOK: query: select /*+mapjoin(a,b)*/ * from smb_bucket_1 a left outer join smb_bucket_2 b on a.key \
= b.key left outer join smb_bucket_3 c on b.key=c.key
POSTHOOK: type: QUERY
POSTHOOK: Input: default@smb_bucket_2
POSTHOOK: Input: default@smb_bucket_3
POSTHOOK: Input: default@smb_bucket_1
POSTHOOK: Output: file:/Users/heyongqiang/Documents/workspace/Hive-Test/build/ql/scratchdir/hive_2010-03\
-02_16-28-42_346_3202067314016412424/10000
1 val_1 NULL  NULL  NULL  NULL
3 val_3 NULL  NULL  NULL  NULL
4 val_4 NULL  NULL  NULL  NULL
5 val_5 NULL  NULL  NULL  NULL
10  val_10  NULL  NULL  NULL  NULL


I am not sure if the above semantics are correct - this may be a existing bug in the code, can you check the semantics of mysql and oracle ?

> sorted merge join
> -----------------
>
>                 Key: HIVE-1194
>                 URL: https://issues.apache.org/jira/browse/HIVE-1194
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: He Yongqiang
>             Fix For: 0.6.0
>
>         Attachments: hive-1194-2010-02-28.patch, hive-1194-2010-3-2.2.patch, hive-1194-2010-3-2.patch
>
>
> If the input tables are sorted on the join key, and a mapjoin is being performed, it is useful to exploit the sorted properties of the table.
> This can lead to substantial cpu savings - this needs to work across bucketed map joins also.
> Since, sorted properties of a table are not enforced currently, a new parameter can be added to specify to use the sort-merge join.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-1194) sorted merge join

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-1194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12838113#action_12838113 ] 

Namit Jain commented on HIVE-1194:
----------------------------------

Based on a offline discussion with Yongqiang, we were thinking of the following:


There will be a new mapping in MapredWork ->
Operator -> MapredLocalWork

This will be populated for SortMergeJoinOperator only.

SortMergeJoinOperator is a new operator which extends MapJoinOperator, and has the
same name as a MapJoinOperator.

MapJoinProcessor needs to create a SortMergeJoinOperator instead of a MapJoinOperator
when it sees the new configuration parameter.

MapJoinFactory methods need to change to create Operator->MapredLocalWork instead of
MapredLocalWork in MapredWork.

> sorted merge join
> -----------------
>
>                 Key: HIVE-1194
>                 URL: https://issues.apache.org/jira/browse/HIVE-1194
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: He Yongqiang
>             Fix For: 0.6.0
>
>
> If the input tables are sorted on the join key, and a mapjoin is being performed, it is useful to exploit the sorted properties of the table.
> This can lead to substantial cpu savings - this needs to work across bucketed map joins also.
> Since, sorted properties of a table are not enforced currently, a new parameter can be added to specify to use the sort-merge join.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-1194) sorted merge join

Posted by "He Yongqiang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-1194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12838122#action_12838122 ] 

He Yongqiang commented on HIVE-1194:
------------------------------------

Yes. It does not need those storage. 
The main reason of letting it extend mapjoinop is because with that we can reuse the code for mapjoinop doing optimization and task generation.

> sorted merge join
> -----------------
>
>                 Key: HIVE-1194
>                 URL: https://issues.apache.org/jira/browse/HIVE-1194
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: He Yongqiang
>             Fix For: 0.6.0
>
>
> If the input tables are sorted on the join key, and a mapjoin is being performed, it is useful to exploit the sorted properties of the table.
> This can lead to substantial cpu savings - this needs to work across bucketed map joins also.
> Since, sorted properties of a table are not enforced currently, a new parameter can be added to specify to use the sort-merge join.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-1194) sorted merge join

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-1194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12841442#action_12841442 ] 

Namit Jain commented on HIVE-1194:
----------------------------------

I know - the log file is correct, but when I run the tests, I get a diff.

> sorted merge join
> -----------------
>
>                 Key: HIVE-1194
>                 URL: https://issues.apache.org/jira/browse/HIVE-1194
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: He Yongqiang
>             Fix For: 0.6.0
>
>         Attachments: hive-1194-2010-02-28.patch, hive-1194-2010-3-2.2.patch, hive-1194-2010-3-2.patch, hive-1194-2010-3-3-2.patch, hive-1194-2010-3-3.patch
>
>
> If the input tables are sorted on the join key, and a mapjoin is being performed, it is useful to exploit the sorted properties of the table.
> This can lead to substantial cpu savings - this needs to work across bucketed map joins also.
> Since, sorted properties of a table are not enforced currently, a new parameter can be added to specify to use the sort-merge join.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-1194) sorted merge join

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-1194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12840999#action_12840999 ] 

Namit Jain commented on HIVE-1194:
----------------------------------

Need to report progress for sort-merge join

> sorted merge join
> -----------------
>
>                 Key: HIVE-1194
>                 URL: https://issues.apache.org/jira/browse/HIVE-1194
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: He Yongqiang
>             Fix For: 0.6.0
>
>         Attachments: hive-1194-2010-02-28.patch, hive-1194-2010-3-2.2.patch, hive-1194-2010-3-2.patch, hive-1194-2010-3-3.patch
>
>
> If the input tables are sorted on the join key, and a mapjoin is being performed, it is useful to exploit the sorted properties of the table.
> This can lead to substantial cpu savings - this needs to work across bucketed map joins also.
> Since, sorted properties of a table are not enforced currently, a new parameter can be added to specify to use the sort-merge join.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-1194) sorted merge join

Posted by "He Yongqiang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-1194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12840825#action_12840825 ] 

He Yongqiang commented on HIVE-1194:
------------------------------------

@namit, are u looking at patch hive-1194-2010-3-2.2.patch?

For the last two queries you mentioned above,

 select /+mapjoin(a,b)/ * from smb_bucket_1 a right outer join smb_bucket_2 b on a.key \
= b.key join smb_bucket_3 c on b.key=c.key

and

select /+mapjoin(a,b)/ * from smb_bucket_1 a right outer join smb_bucket_2 b on a.key \
= b.key right outer join smb_bucket_3 c on b.key=c.key


The results look good to me.
Results:

NULL	NULL	20	val_20	20	val_20
NULL	NULL	23	val_23	23	val_23

and

NULL	NULL	NULL	NULL	4	val_4
NULL	NULL	NULL	NULL	10	val_10
NULL	NULL	NULL	NULL	17	val_17
NULL	NULL	NULL	NULL	19	val_19
NULL	NULL	20	val_20	20	val_20
NULL	NULL	23	val_23	23	val_23


Will check oracle and mysql about the semantics of the first two queries you commented.

> sorted merge join
> -----------------
>
>                 Key: HIVE-1194
>                 URL: https://issues.apache.org/jira/browse/HIVE-1194
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: He Yongqiang
>             Fix For: 0.6.0
>
>         Attachments: hive-1194-2010-02-28.patch, hive-1194-2010-3-2.2.patch, hive-1194-2010-3-2.patch
>
>
> If the input tables are sorted on the join key, and a mapjoin is being performed, it is useful to exploit the sorted properties of the table.
> This can lead to substantial cpu savings - this needs to work across bucketed map joins also.
> Since, sorted properties of a table are not enforced currently, a new parameter can be added to specify to use the sort-merge join.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HIVE-1194) sorted merge join

Posted by "He Yongqiang (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-1194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

He Yongqiang updated HIVE-1194:
-------------------------------

    Attachment: hive-1194-2010-3-2.2.patch

> sorted merge join
> -----------------
>
>                 Key: HIVE-1194
>                 URL: https://issues.apache.org/jira/browse/HIVE-1194
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: He Yongqiang
>             Fix For: 0.6.0
>
>         Attachments: hive-1194-2010-02-28.patch, hive-1194-2010-3-2.2.patch, hive-1194-2010-3-2.patch
>
>
> If the input tables are sorted on the join key, and a mapjoin is being performed, it is useful to exploit the sorted properties of the table.
> This can lead to substantial cpu savings - this needs to work across bucketed map joins also.
> Since, sorted properties of a table are not enforced currently, a new parameter can be added to specify to use the sort-merge join.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-1194) sorted merge join

Posted by "Siying Dong (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-1194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12840895#action_12840895 ] 

Siying Dong commented on HIVE-1194:
-----------------------------------

Yongqiang, can you add a test case that the "big table" is generated from "select * from XXX where XXX" and make sure the 3-way joining query works well?

> sorted merge join
> -----------------
>
>                 Key: HIVE-1194
>                 URL: https://issues.apache.org/jira/browse/HIVE-1194
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: He Yongqiang
>             Fix For: 0.6.0
>
>         Attachments: hive-1194-2010-02-28.patch, hive-1194-2010-3-2.2.patch, hive-1194-2010-3-2.patch
>
>
> If the input tables are sorted on the join key, and a mapjoin is being performed, it is useful to exploit the sorted properties of the table.
> This can lead to substantial cpu savings - this needs to work across bucketed map joins also.
> Since, sorted properties of a table are not enforced currently, a new parameter can be added to specify to use the sort-merge join.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-1194) sorted merge join

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-1194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12838130#action_12838130 ] 

Namit Jain commented on HIVE-1194:
----------------------------------

A new optimization step will be created which will convert the mapjoin to a sortmergejoin

> sorted merge join
> -----------------
>
>                 Key: HIVE-1194
>                 URL: https://issues.apache.org/jira/browse/HIVE-1194
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: He Yongqiang
>             Fix For: 0.6.0
>
>
> If the input tables are sorted on the join key, and a mapjoin is being performed, it is useful to exploit the sorted properties of the table.
> This can lead to substantial cpu savings - this needs to work across bucketed map joins also.
> Since, sorted properties of a table are not enforced currently, a new parameter can be added to specify to use the sort-merge join.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-1194) sorted merge join

Posted by "Zheng Shao (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-1194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12838120#action_12838120 ] 

Zheng Shao commented on HIVE-1194:
----------------------------------

Why does SortMergeJoinOperator extends MapJoinOperator?
It seems to me that SortMergeJoinOperator does NOTneed the in-memory/disk-backed HashMap that MapJoinOperator has, correct?


> sorted merge join
> -----------------
>
>                 Key: HIVE-1194
>                 URL: https://issues.apache.org/jira/browse/HIVE-1194
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: He Yongqiang
>             Fix For: 0.6.0
>
>
> If the input tables are sorted on the join key, and a mapjoin is being performed, it is useful to exploit the sorted properties of the table.
> This can lead to substantial cpu savings - this needs to work across bucketed map joins also.
> Since, sorted properties of a table are not enforced currently, a new parameter can be added to specify to use the sort-merge join.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-1194) sorted merge join

Posted by "Zheng Shao (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-1194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12838132#action_12838132 ] 

Zheng Shao commented on HIVE-1194:
----------------------------------

If it does not inherit any methods, shall we add an AbstractMapJoinOperator as the common parent?
That AbstractMapJoinOperator can be converted to MapJoinOperator (or HashBasedMapJoinOperator, to be accurate) or SortMergeJoinOperator depending on the configuration/table properties.


> sorted merge join
> -----------------
>
>                 Key: HIVE-1194
>                 URL: https://issues.apache.org/jira/browse/HIVE-1194
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: He Yongqiang
>             Fix For: 0.6.0
>
>
> If the input tables are sorted on the join key, and a mapjoin is being performed, it is useful to exploit the sorted properties of the table.
> This can lead to substantial cpu savings - this needs to work across bucketed map joins also.
> Since, sorted properties of a table are not enforced currently, a new parameter can be added to specify to use the sort-merge join.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-1194) sorted merge join

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-1194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12841099#action_12841099 ] 

Namit Jain commented on HIVE-1194:
----------------------------------

There is a problem in smb_mapjoin_6.q - the checked in results seem OK - but I am getting a diff.
Can you investigate ? 

There are 2 problems:

1. order not deterministic.
2. Bigger problem: 498 missing from the results for the first query

> sorted merge join
> -----------------
>
>                 Key: HIVE-1194
>                 URL: https://issues.apache.org/jira/browse/HIVE-1194
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: He Yongqiang
>             Fix For: 0.6.0
>
>         Attachments: hive-1194-2010-02-28.patch, hive-1194-2010-3-2.2.patch, hive-1194-2010-3-2.patch, hive-1194-2010-3-3-2.patch, hive-1194-2010-3-3.patch
>
>
> If the input tables are sorted on the join key, and a mapjoin is being performed, it is useful to exploit the sorted properties of the table.
> This can lead to substantial cpu savings - this needs to work across bucketed map joins also.
> Since, sorted properties of a table are not enforced currently, a new parameter can be added to specify to use the sort-merge join.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HIVE-1194) sorted merge join

Posted by "He Yongqiang (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-1194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

He Yongqiang updated HIVE-1194:
-------------------------------

    Attachment: hive-1194-2010-3-2.patch

a new patch added more testcases and fixed some bugs.
@namit,
I agree, that will make the code more clear. can we do that in a followup jira, because it requires a code refactoring which may break existing mapjoin etc.

> sorted merge join
> -----------------
>
>                 Key: HIVE-1194
>                 URL: https://issues.apache.org/jira/browse/HIVE-1194
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: He Yongqiang
>             Fix For: 0.6.0
>
>         Attachments: hive-1194-2010-02-28.patch, hive-1194-2010-3-2.patch
>
>
> If the input tables are sorted on the join key, and a mapjoin is being performed, it is useful to exploit the sorted properties of the table.
> This can lead to substantial cpu savings - this needs to work across bucketed map joins also.
> Since, sorted properties of a table are not enforced currently, a new parameter can be added to specify to use the sort-merge join.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-1194) sorted merge join

Posted by "He Yongqiang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-1194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12840829#action_12840829 ] 

He Yongqiang commented on HIVE-1194:
------------------------------------

btw, i just checked the results without map join hints. The results are consistent.

> sorted merge join
> -----------------
>
>                 Key: HIVE-1194
>                 URL: https://issues.apache.org/jira/browse/HIVE-1194
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: He Yongqiang
>             Fix For: 0.6.0
>
>         Attachments: hive-1194-2010-02-28.patch, hive-1194-2010-3-2.2.patch, hive-1194-2010-3-2.patch
>
>
> If the input tables are sorted on the join key, and a mapjoin is being performed, it is useful to exploit the sorted properties of the table.
> This can lead to substantial cpu savings - this needs to work across bucketed map joins also.
> Since, sorted properties of a table are not enforced currently, a new parameter can be added to specify to use the sort-merge join.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-1194) sorted merge join

Posted by "Siying Dong (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-1194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12840969#action_12840969 ] 

Siying Dong commented on HIVE-1194:
-----------------------------------

Turns out to be, we also need to support sub query for "small table" like:

select /* mapjoin(t) */ from (select * from a where ...) t join ....

> sorted merge join
> -----------------
>
>                 Key: HIVE-1194
>                 URL: https://issues.apache.org/jira/browse/HIVE-1194
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: He Yongqiang
>             Fix For: 0.6.0
>
>         Attachments: hive-1194-2010-02-28.patch, hive-1194-2010-3-2.2.patch, hive-1194-2010-3-2.patch
>
>
> If the input tables are sorted on the join key, and a mapjoin is being performed, it is useful to exploit the sorted properties of the table.
> This can lead to substantial cpu savings - this needs to work across bucketed map joins also.
> Since, sorted properties of a table are not enforced currently, a new parameter can be added to specify to use the sort-merge join.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (HIVE-1194) sorted merge join

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-1194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Namit Jain resolved HIVE-1194.
------------------------------

      Resolution: Fixed
    Hadoop Flags: [Reviewed]

Committed. Thanks Yongqiang

> sorted merge join
> -----------------
>
>                 Key: HIVE-1194
>                 URL: https://issues.apache.org/jira/browse/HIVE-1194
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: He Yongqiang
>             Fix For: 0.6.0
>
>         Attachments: hive-1194-2010-02-28.patch, hive-1194-2010-3-2.2.patch, hive-1194-2010-3-2.patch, hive-1194-2010-3-3-2.patch, hive-1194-2010-3-3.patch, hive-1194-2010-3-4.patch
>
>
> If the input tables are sorted on the join key, and a mapjoin is being performed, it is useful to exploit the sorted properties of the table.
> This can lead to substantial cpu savings - this needs to work across bucketed map joins also.
> Since, sorted properties of a table are not enforced currently, a new parameter can be added to specify to use the sort-merge join.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HIVE-1194) sorted merge join

Posted by "He Yongqiang (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-1194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

He Yongqiang updated HIVE-1194:
-------------------------------

    Attachment: hive-1194-2010-3-4.patch

attached a new patch 

> sorted merge join
> -----------------
>
>                 Key: HIVE-1194
>                 URL: https://issues.apache.org/jira/browse/HIVE-1194
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: He Yongqiang
>             Fix For: 0.6.0
>
>         Attachments: hive-1194-2010-02-28.patch, hive-1194-2010-3-2.2.patch, hive-1194-2010-3-2.patch, hive-1194-2010-3-3-2.patch, hive-1194-2010-3-3.patch, hive-1194-2010-3-4.patch
>
>
> If the input tables are sorted on the join key, and a mapjoin is being performed, it is useful to exploit the sorted properties of the table.
> This can lead to substantial cpu savings - this needs to work across bucketed map joins also.
> Since, sorted properties of a table are not enforced currently, a new parameter can be added to specify to use the sort-merge join.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-1194) sorted merge join

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-1194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12838121#action_12838121 ] 

Namit Jain commented on HIVE-1194:
----------------------------------

Yes, but it happens on the mapper. It is a special type of mapjoin.
It will end up overwriting all the functions of map-join, but keeping it this way keeps the hierarchy correct

> sorted merge join
> -----------------
>
>                 Key: HIVE-1194
>                 URL: https://issues.apache.org/jira/browse/HIVE-1194
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: He Yongqiang
>             Fix For: 0.6.0
>
>
> If the input tables are sorted on the join key, and a mapjoin is being performed, it is useful to exploit the sorted properties of the table.
> This can lead to substantial cpu savings - this needs to work across bucketed map joins also.
> Since, sorted properties of a table are not enforced currently, a new parameter can be added to specify to use the sort-merge join.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.