You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by "Edward Yoon (JIRA)" <ji...@apache.org> on 2007/10/10 10:05:50 UTC

[jira] Created: (HADOOP-2021) θ Join Condition

θ Join Condition
----------------

                 Key: HADOOP-2021
                 URL: https://issues.apache.org/jira/browse/HADOOP-2021
             Project: Hadoop
          Issue Type: Sub-task
            Reporter: Edward Yoon




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

TableMap and TableReduce in hbase.

Posted by edward yoon <we...@udanax.org>.

I was found a bug in TableMap and TableReduce.
The bug makes a lot of duplicated qualifier of column.

------------------------------

B. Regards,

Edward yoon @ NHN, corp.
Home : http://www.udanax.org
_________________________________________________________________
Share life as it happens with the new Windows Live.Download today it's FREE!
http://www.windowslive.com/share.html?ocid=TXT_TAGLM_Wave2_sharelife_112007

[jira] Issue Comment Edited: (HADOOP-2021) Sort Join Implementation

Posted by "Edward Yoon (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-2021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12543331 ] 

udanax edited comment on HADOOP-2021 at 11/17/07 7:16 PM:
---------------------------------------------------------------

{code}
r1
       a     b    c
======================
row1   a1    b1   c1
row2   a2    b2   c2

r2
       e     f
==================
row1   e1    a1
row2   e2    f2
row3   e3    f3
row4   e4    a1

r1 = table('r1');
r2 = table('r2');
r3 = r1.join(r1.a = r2.f) and r2;

r3
      a    b    c   row    e   f  
=====================================
row1  a1   b1   c1  row1  e1  a1
row1  a1   b1   c1  row4  e4  a1
{code}

      was (Author: udanax):
    r1
       a     b    c
================
row1   a1    b1   c1
row2   a2    b2   c2

r2
       e     f
============
row1   e1    a1
row2   e2    f2
row3   e3    f3
row4   e4    a1

{code}
r1 = table('r1');
r2 = table('r2');
r3 = r1.join(r1.a = r2.f) and r2;
{code}

r3
      a    b    c   row    e   f  
=========================
row1  a1   b1   c1  row1  e1  a1
row1  a1   b1   c1  row4  e4  a1

  
> Sort Join Implementation
> ------------------------
>
>                 Key: HADOOP-2021
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2021
>             Project: Hadoop
>          Issue Type: Sub-task
>          Components: contrib/hbase
>    Affects Versions: 0.14.1
>         Environment: all environments  
>            Reporter: Edward Yoon
>            Priority: Minor
>             Fix For: 0.16.0
>
>         Attachments: 2021_v01.patch
>
>
> If we don't have an index for a domain in the join, we can still improve on the nested-loop join using sort join.
> {code}
> R1 = table('movieLog_table');
> R2 = table('stockCompany_info');
> result = R1.join(R1.studioName = R2.corporation) and R2;
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (HADOOP-2021) Sort Join Implementation

Posted by "Edward Yoon (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-2021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12547037 ] 

udanax edited comment on HADOOP-2021 at 11/30/07 12:18 AM:
----------------------------------------------------------------

-added some comments.

I wrote some different way of join by mapreduce processing twice.
I think we discuss about parallel sort merge join later.

      was (Author: udanax):
    -added some comments.

It was used little different way of join by mapreduce processing twice.
I think we discuss about parallel sort merge join later.
  
> Sort Join Implementation
> ------------------------
>
>                 Key: HADOOP-2021
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2021
>             Project: Hadoop
>          Issue Type: Sub-task
>          Components: contrib/hbase
>    Affects Versions: 0.14.1
>         Environment: all environments  
>            Reporter: Edward Yoon
>            Assignee: Edward Yoon
>            Priority: Minor
>             Fix For: 0.16.0
>
>         Attachments: 2021_v01.patch, 2021_v02.txt, 2021_v04.patch, 2021_v05, 2021_v05.patch, 2021_v06.patch
>
>
> If we don't have an index for a domain in the join, we can still improve on the nested-loop join using sort join.
> {code}
> R1 = table('movieLog_table');
> R2 = table('stockCompany_info');
> result = R1.join(R1.studioName = R2.corporation) and R2;
> {code}
> ----
> {code}
> r1
>        a     b    c
> ======================
> row1   a1    b1   c1
> row2   a2    b2   c2
> row3   a1    b3   c3
> r2
>        e     f
> ==================
> row1   e1    a1
> row2   e2    f2
> row3   e3    f3
> row4   e4    a1
> row5   e5    a2
> r1 = table('r1');
> r2 = table('r2');
> r3 = r1.join(r1.a = r2.f) and r2;
> ---------------------------------------------
> temp table T : Sorted set by "f"
>     row       e     f
> ===========================
> a1  row:row1  e1    a1
>     row:row4  e4    
> a2  row:row5  e5    a2
> f2  row:row2  e2    f2
> f3  row:row3  e3    f3
> ---------------------------------------------
> r3
>            r1.row   a    b    c   r2.row    e   f  
> ===================================================
> row1.row1  row1     a1   b1   c1  row1      e1  a1
> row1.row4  row1     a1   b1   c1  row4      e4  a1
> row2.row5  row2     a2   b2   c2  row5      e5  a2
> row3.row1  row3     a1   b3   c3  row1      e1  a1
> row3.row4  row3     a1   b3   c3  row4      e4  a1
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-2021) Sort Join Implementation

Posted by "Edward Yoon (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-2021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12543333 ] 

Edward Yoon commented on HADOOP-2021:
-------------------------------------

But, Can not insert duplicate key row in hbase table.
so........

r3
               a  b  c  e  f
===========================
row1.row4   a1 b1 c1 e1 a1
row1.row4   a1 b1 c1 e4 a1

hmm. 
is it all right in theory ??

I need a any advice.

> Sort Join Implementation
> ------------------------
>
>                 Key: HADOOP-2021
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2021
>             Project: Hadoop
>          Issue Type: Sub-task
>          Components: contrib/hbase
>    Affects Versions: 0.14.1
>         Environment: all environments  
>            Reporter: Edward Yoon
>            Priority: Minor
>             Fix For: 0.16.0
>
>         Attachments: 2021_v01.patch
>
>
> If we don't have an index for a domain in the join, we can still improve on the nested-loop join using sort join.
> {code}
> R1 = table('movieLog_table');
> R2 = table('stockCompany_info');
> result = R1.join(R1.studioName = R2.corporation) and R2;
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-2021) [Hbase Shell] Sort Join Implementation

Posted by "Edward Yoon (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-2021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Edward Yoon updated HADOOP-2021:
--------------------------------

    Status: Open  (was: Patch Available)

> [Hbase Shell] Sort Join Implementation
> --------------------------------------
>
>                 Key: HADOOP-2021
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2021
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: contrib/hbase
>    Affects Versions: 0.16.0
>         Environment: all environments  
>            Reporter: Edward Yoon
>            Assignee: Edward Yoon
>             Fix For: 0.16.0
>
>         Attachments: 2021_v01.patch, 2021_v02.txt, 2021_v04.patch, 2021_v05, 2021_v05.patch, 2021_v06.patch
>
>
> If we don't have an index for a domain in the join, we can still improve on the nested-loop join using sort join.
> {code}
> R1 = table('movieLog_table');
> R2 = table('stockCompany_info');
> result = R1.join(R1.studioName = R2.corporation) and R2;
> {code}
> ----
> {code}
> r1
>        a     b    c
> ======================
> row1   a1    b1   c1
> row2   a2    b2   c2
> row3   a1    b3   c3
> r2
>        e     f
> ==================
> row1   e1    a1
> row2   e2    f2
> row3   e3    f3
> row4   e4    a1
> row5   e5    a2
> r1 = table('r1');
> r2 = table('r2');
> r3 = r1.join(r1.a = r2.f) and r2;
> ---------------------------------------------
> temp table T : Sorted set by "f"
>     row       
> =============
> a1  row:row1  
>     row:row4  
> a2  row:row5 
> f2  row:row2 
> f3  row:row3 
> ---------------------
> r3
>            r1.row   a    b    c   r2.row    e   f  
> ===================================================
> row1.row1  row1     a1   b1   c1  row1      e1  a1
> row1.row4  row1     a1   b1   c1  row4      e4  a1
> row2.row5  row2     a2   b2   c2  row5      e5  a2
> row3.row1  row3     a1   b3   c3  row1      e1  a1
> row3.row4  row3     a1   b3   c3  row4      e4  a1
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Work started: (HADOOP-2021) Sort Join Implementation

Posted by "Edward Yoon (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-2021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Work on HADOOP-2021 started by Edward Yoon.

> Sort Join Implementation
> ------------------------
>
>                 Key: HADOOP-2021
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2021
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: contrib/hbase
>    Affects Versions: 0.15.0
>         Environment: all environments  
>            Reporter: Edward Yoon
>            Assignee: Edward Yoon
>            Priority: Minor
>             Fix For: 0.16.0
>
>         Attachments: 2021_v01.patch, 2021_v02.txt, 2021_v04.patch, 2021_v05, 2021_v05.patch, 2021_v06.patch
>
>
> If we don't have an index for a domain in the join, we can still improve on the nested-loop join using sort join.
> {code}
> R1 = table('movieLog_table');
> R2 = table('stockCompany_info');
> result = R1.join(R1.studioName = R2.corporation) and R2;
> {code}
> ----
> {code}
> r1
>        a     b    c
> ======================
> row1   a1    b1   c1
> row2   a2    b2   c2
> row3   a1    b3   c3
> r2
>        e     f
> ==================
> row1   e1    a1
> row2   e2    f2
> row3   e3    f3
> row4   e4    a1
> row5   e5    a2
> r1 = table('r1');
> r2 = table('r2');
> r3 = r1.join(r1.a = r2.f) and r2;
> ---------------------------------------------
> temp table T : Sorted set by "f"
>     row       
> =============
> a1  row:row1  
>     row:row4  
> a2  row:row5 
> f2  row:row2 
> f3  row:row3 
> ---------------------
> r3
>            r1.row   a    b    c   r2.row    e   f  
> ===================================================
> row1.row1  row1     a1   b1   c1  row1      e1  a1
> row1.row4  row1     a1   b1   c1  row4      e4  a1
> row2.row5  row2     a2   b2   c2  row5      e5  a2
> row3.row1  row3     a1   b3   c3  row1      e1  a1
> row3.row4  row3     a1   b3   c3  row4      e4  a1
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-2021) [Hbase Shell] Sort Join Implementation

Posted by "Edward Yoon (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-2021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Edward Yoon updated HADOOP-2021:
--------------------------------

    Resolution: Duplicate
        Status: Resolved  (was: Patch Available)

This issue seems broken.
I move to HADOOP-2328.


> [Hbase Shell] Sort Join Implementation
> --------------------------------------
>
>                 Key: HADOOP-2021
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2021
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: contrib/hbase
>    Affects Versions: 0.16.0
>         Environment: all environments  
>            Reporter: Edward Yoon
>            Assignee: Edward Yoon
>             Fix For: 0.16.0
>
>         Attachments: 2021_v01.patch, 2021_v02.txt, 2021_v04.patch, 2021_v05, 2021_v05.patch, 2021_v06.patch
>
>
> If we don't have an index for a domain in the join, we can still improve on the nested-loop join using sort join.
> {code}
> R1 = table('movieLog_table');
> R2 = table('stockCompany_info');
> result = R1.join(R1.studioName = R2.corporation) and R2;
> {code}
> ----
> {code}
> r1
>        a     b    c
> ======================
> row1   a1    b1   c1
> row2   a2    b2   c2
> row3   a1    b3   c3
> r2
>        e     f
> ==================
> row1   e1    a1
> row2   e2    f2
> row3   e3    f3
> row4   e4    a1
> row5   e5    a2
> r1 = table('r1');
> r2 = table('r2');
> r3 = r1.join(r1.a = r2.f) and r2;
> ---------------------------------------------
> temp table T : Sorted set by "f"
>     row       
> =============
> a1  row:row1  
>     row:row4  
> a2  row:row5 
> f2  row:row2 
> f3  row:row3 
> ---------------------
> r3
>            r1.row   a    b    c   r2.row    e   f  
> ===================================================
> row1.row1  row1     a1   b1   c1  row1      e1  a1
> row1.row4  row1     a1   b1   c1  row4      e4  a1
> row2.row5  row2     a2   b2   c2  row5      e5  a2
> row3.row1  row3     a1   b3   c3  row1      e1  a1
> row3.row4  row3     a1   b3   c3  row4      e4  a1
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (HADOOP-2021) Sort Join Implementation

Posted by "Edward Yoon (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-2021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12543333 ] 

udanax edited comment on HADOOP-2021 at 11/18/07 2:04 AM:
---------------------------------------------------------------

But, Can not insert duplicate key row in hbase table.
so........

{code}
r3
            a  b  c  e  f
==============================
row1.row1   a1 b1 c1 e1 a1
row1.row4   a1 b1 c1 e4 a1
{code}

Then, Can i get a key set of r1 table or r2 table?

hmm. 
Also, is it all right in theory ??

If you have any ideas, let me know.

      was (Author: udanax):
    But, Can not insert duplicate key row in hbase table.
so........

{code}
r3
            a  b  c  e  f
==============================
row1.row1   a1 b1 c1 e1 a1
row1.row4   a1 b1 c1 e4 a1
{code}

Then, Can i get a r1.row set or r2.row set??

hmm. 
Also, is it all right in theory ??

If you have any ideas, let me know.
  
> Sort Join Implementation
> ------------------------
>
>                 Key: HADOOP-2021
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2021
>             Project: Hadoop
>          Issue Type: Sub-task
>          Components: contrib/hbase
>    Affects Versions: 0.14.1
>         Environment: all environments  
>            Reporter: Edward Yoon
>            Priority: Minor
>             Fix For: 0.16.0
>
>         Attachments: 2021_v01.patch
>
>
> If we don't have an index for a domain in the join, we can still improve on the nested-loop join using sort join.
> {code}
> R1 = table('movieLog_table');
> R2 = table('stockCompany_info');
> result = R1.join(R1.studioName = R2.corporation) and R2;
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-2021) [Hbase Shell] Sort Join Implementation

Posted by "Edward Yoon (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-2021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Edward Yoon updated HADOOP-2021:
--------------------------------

    Due Date: 01/Dec/07
    Priority: Major  (was: Minor)
     Summary: [Hbase Shell] Sort Join Implementation  (was: Sort Join Implementation)

> [Hbase Shell] Sort Join Implementation
> --------------------------------------
>
>                 Key: HADOOP-2021
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2021
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: contrib/hbase
>    Affects Versions: 0.15.0
>         Environment: all environments  
>            Reporter: Edward Yoon
>            Assignee: Edward Yoon
>             Fix For: 0.16.0
>
>         Attachments: 2021_v01.patch, 2021_v02.txt, 2021_v04.patch, 2021_v05, 2021_v05.patch, 2021_v06.patch
>
>
> If we don't have an index for a domain in the join, we can still improve on the nested-loop join using sort join.
> {code}
> R1 = table('movieLog_table');
> R2 = table('stockCompany_info');
> result = R1.join(R1.studioName = R2.corporation) and R2;
> {code}
> ----
> {code}
> r1
>        a     b    c
> ======================
> row1   a1    b1   c1
> row2   a2    b2   c2
> row3   a1    b3   c3
> r2
>        e     f
> ==================
> row1   e1    a1
> row2   e2    f2
> row3   e3    f3
> row4   e4    a1
> row5   e5    a2
> r1 = table('r1');
> r2 = table('r2');
> r3 = r1.join(r1.a = r2.f) and r2;
> ---------------------------------------------
> temp table T : Sorted set by "f"
>     row       
> =============
> a1  row:row1  
>     row:row4  
> a2  row:row5 
> f2  row:row2 
> f3  row:row3 
> ---------------------
> r3
>            r1.row   a    b    c   r2.row    e   f  
> ===================================================
> row1.row1  row1     a1   b1   c1  row1      e1  a1
> row1.row4  row1     a1   b1   c1  row4      e4  a1
> row2.row5  row2     a2   b2   c2  row5      e5  a2
> row3.row1  row3     a1   b3   c3  row1      e1  a1
> row3.row4  row3     a1   b3   c3  row4      e4  a1
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-2021) Sort Join Implementation

Posted by "Edward Yoon (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-2021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12544187 ] 

Edward Yoon commented on HADOOP-2021:
-------------------------------------

The bug is reported in HADOOP-2244 and HADOOP-2234.

> Sort Join Implementation
> ------------------------
>
>                 Key: HADOOP-2021
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2021
>             Project: Hadoop
>          Issue Type: Sub-task
>          Components: contrib/hbase
>    Affects Versions: 0.14.1
>         Environment: all environments  
>            Reporter: Edward Yoon
>            Priority: Minor
>             Fix For: 0.16.0
>
>         Attachments: 2021_v01.patch, 2021_v02.txt
>
>
> If we don't have an index for a domain in the join, we can still improve on the nested-loop join using sort join.
> {code}
> R1 = table('movieLog_table');
> R2 = table('stockCompany_info');
> result = R1.join(R1.studioName = R2.corporation) and R2;
> {code}
> ----
> {code}
> r1
>        a     b    c
> ======================
> row1   a1    b1   c1
> row2   a2    b2   c2
> row3   a1    b3   c3
> r2
>        e     f
> ==================
> row1   e1    a1
> row2   e2    f2
> row3   e3    f3
> row4   e4    a1
> row5   e5    a2
> r1 = table('r1');
> r2 = table('r2');
> r3 = r1.join(r1.a = r2.f) and r2;
> ---------------------------------------------
> temp table T : Sorted set by "f"
>     row       e     f
> ===========================
> a1  row:row1  e1    a1
>     row:row4  e4    
> a2  row:row5  e5    a2
> f2  row:row2  e2    f2
> f3  row:row3  e3    f3
> ---------------------------------------------
> r3
>            r1.row   a    b    c   r2.row    e   f  
> ===================================================
> row1.row1  row1     a1   b1   c1  row1      e1  a1
> row1.row4  row1     a1   b1   c1  row4      e4  a1
> row2.row5  row2     a2   b2   c2  row5      e5  a2
> row3.row1  row3     a1   b3   c3  row1      e1  a1
> row3.row4  row3     a1   b3   c3  row4      e4  a1
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-2021) Sort Join Implementation

Posted by "Edward Yoon (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-2021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Edward Yoon updated HADOOP-2021:
--------------------------------

    Status: Patch Available  (was: Open)

submitting.

> Sort Join Implementation
> ------------------------
>
>                 Key: HADOOP-2021
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2021
>             Project: Hadoop
>          Issue Type: Sub-task
>          Components: contrib/hbase
>    Affects Versions: 0.14.1
>         Environment: all environments  
>            Reporter: Edward Yoon
>            Assignee: Edward Yoon
>            Priority: Minor
>             Fix For: 0.16.0
>
>         Attachments: 2021_v01.patch, 2021_v02.txt, 2021_v04.patch, 2021_v05, 2021_v05.patch
>
>
> If we don't have an index for a domain in the join, we can still improve on the nested-loop join using sort join.
> {code}
> R1 = table('movieLog_table');
> R2 = table('stockCompany_info');
> result = R1.join(R1.studioName = R2.corporation) and R2;
> {code}
> ----
> {code}
> r1
>        a     b    c
> ======================
> row1   a1    b1   c1
> row2   a2    b2   c2
> row3   a1    b3   c3
> r2
>        e     f
> ==================
> row1   e1    a1
> row2   e2    f2
> row3   e3    f3
> row4   e4    a1
> row5   e5    a2
> r1 = table('r1');
> r2 = table('r2');
> r3 = r1.join(r1.a = r2.f) and r2;
> ---------------------------------------------
> temp table T : Sorted set by "f"
>     row       e     f
> ===========================
> a1  row:row1  e1    a1
>     row:row4  e4    
> a2  row:row5  e5    a2
> f2  row:row2  e2    f2
> f3  row:row3  e3    f3
> ---------------------------------------------
> r3
>            r1.row   a    b    c   r2.row    e   f  
> ===================================================
> row1.row1  row1     a1   b1   c1  row1      e1  a1
> row1.row4  row1     a1   b1   c1  row4      e4  a1
> row2.row5  row2     a2   b2   c2  row5      e5  a2
> row3.row1  row3     a1   b3   c3  row1      e1  a1
> row3.row4  row3     a1   b3   c3  row4      e4  a1
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (HADOOP-2021) Sort Join Implementation

Posted by "Edward Yoon (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-2021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12543333 ] 

udanax edited comment on HADOOP-2021 at 11/17/07 7:19 PM:
---------------------------------------------------------------

But, Can not insert duplicate key row in hbase table.
so........

{code}
r3
            a  b  c  e  f
==============================
row1.row4   a1 b1 c1 e1 a1
row1.row4   a1 b1 c1 e4 a1
{code}

hmm. 
is it all right in theory ??

If you have any ideas, let me know.

      was (Author: udanax):
    But, Can not insert duplicate key row in hbase table.
so........

{code}
r3
            a  b  c  e  f
==============================
row1.row4   a1 b1 c1 e1 a1
row1.row4   a1 b1 c1 e4 a1
{code}

hmm. 
is it all right in theory ??

If you have a idea, let me know.
  
> Sort Join Implementation
> ------------------------
>
>                 Key: HADOOP-2021
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2021
>             Project: Hadoop
>          Issue Type: Sub-task
>          Components: contrib/hbase
>    Affects Versions: 0.14.1
>         Environment: all environments  
>            Reporter: Edward Yoon
>            Priority: Minor
>             Fix For: 0.16.0
>
>         Attachments: 2021_v01.patch
>
>
> If we don't have an index for a domain in the join, we can still improve on the nested-loop join using sort join.
> {code}
> R1 = table('movieLog_table');
> R2 = table('stockCompany_info');
> result = R1.join(R1.studioName = R2.corporation) and R2;
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (HADOOP-2021) Sort Join Implementation

Posted by "Edward Yoon (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-2021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12543537 ] 

udanax edited comment on HADOOP-2021 at 11/19/07 4:25 AM:
---------------------------------------------------------------

I was found a bug in TableMap, TableReduce classes.
Row iterator of Map/Reduce function makes a lot of duplicated qualifier of column.

      was (Author: udanax):
    I found a bug in TableMap, TableReduce classes.
Row iterator makes a lot of duplicated qualifier of column.
  
> Sort Join Implementation
> ------------------------
>
>                 Key: HADOOP-2021
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2021
>             Project: Hadoop
>          Issue Type: Sub-task
>          Components: contrib/hbase
>    Affects Versions: 0.14.1
>         Environment: all environments  
>            Reporter: Edward Yoon
>            Priority: Minor
>             Fix For: 0.16.0
>
>         Attachments: 2021_v01.patch, 2021_v02.txt
>
>
> If we don't have an index for a domain in the join, we can still improve on the nested-loop join using sort join.
> 1. make a sorted set temp file for sort join using MR job 
> 2. make a new Relation table on hbase
> {code}
> R1 = table('movieLog_table');
> R2 = table('stockCompany_info');
> result = R1.join(R1.studioName = R2.corporation) and R2;
> {code}
> ----
> {code}
> r1
>        a     b    c
> ======================
> row1   a1    b1   c1
> row2   a2    b2   c2
> row3   a1    b3   c3
> r2
>        e     f
> ==================
> row1   e1    a1
> row2   e2    f2
> row3   e3    f3
> row4   e4    a1
> row5   e5    a2
> r1 = table('r1');
> r2 = table('r2');
> r3 = r1.join(r1.a = r2.f) and r2;
> ---------------------------------------------
> temp table T : Sorted set by "f"
>     row       e     f
> ===========================
> a1  row:row1  e1    a1
>     row:row4  e4    
> a2  row:row5  e5    a2
> f2  row:row2  e2    f2
> f3  row:row3  e3    f3
> ---------------------------------------------
> r3
>            r1.row   a    b    c   r2.row    e   f  
> ===================================================
> row1.row1  row1     a1   b1   c1  row1      e1  a1
> row1.row4  row1     a1   b1   c1  row4      e4  a1
> row2.row5  row2     a2   b2   c2  row5      e5  a2
> row3.row1  row3     a1   b3   c3  row1      e1  a1
> row3.row4  row3     a1   b3   c3  row4      e4  a1
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (HADOOP-2021) Sort Join Implementation

Posted by "Edward Yoon (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-2021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Edward Yoon reassigned HADOOP-2021:
-----------------------------------

    Assignee: Edward Yoon

> Sort Join Implementation
> ------------------------
>
>                 Key: HADOOP-2021
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2021
>             Project: Hadoop
>          Issue Type: Sub-task
>          Components: contrib/hbase
>    Affects Versions: 0.14.1
>         Environment: all environments  
>            Reporter: Edward Yoon
>            Assignee: Edward Yoon
>            Priority: Minor
>             Fix For: 0.16.0
>
>         Attachments: 2021_v01.patch, 2021_v02.txt
>
>
> If we don't have an index for a domain in the join, we can still improve on the nested-loop join using sort join.
> {code}
> R1 = table('movieLog_table');
> R2 = table('stockCompany_info');
> result = R1.join(R1.studioName = R2.corporation) and R2;
> {code}
> ----
> {code}
> r1
>        a     b    c
> ======================
> row1   a1    b1   c1
> row2   a2    b2   c2
> row3   a1    b3   c3
> r2
>        e     f
> ==================
> row1   e1    a1
> row2   e2    f2
> row3   e3    f3
> row4   e4    a1
> row5   e5    a2
> r1 = table('r1');
> r2 = table('r2');
> r3 = r1.join(r1.a = r2.f) and r2;
> ---------------------------------------------
> temp table T : Sorted set by "f"
>     row       e     f
> ===========================
> a1  row:row1  e1    a1
>     row:row4  e4    
> a2  row:row5  e5    a2
> f2  row:row2  e2    f2
> f3  row:row3  e3    f3
> ---------------------------------------------
> r3
>            r1.row   a    b    c   r2.row    e   f  
> ===================================================
> row1.row1  row1     a1   b1   c1  row1      e1  a1
> row1.row4  row1     a1   b1   c1  row4      e4  a1
> row2.row5  row2     a2   b2   c2  row5      e5  a2
> row3.row1  row3     a1   b3   c3  row1      e1  a1
> row3.row4  row3     a1   b3   c3  row4      e4  a1
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (HADOOP-2021) Sort Join Implementation

Posted by "Edward Yoon (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-2021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12543333 ] 

udanax edited comment on HADOOP-2021 at 11/17/07 7:17 PM:
---------------------------------------------------------------

But, Can not insert duplicate key row in hbase table.
so........

{code}
r3
            a  b  c  e  f
==============================
row1.row4   a1 b1 c1 e1 a1
row1.row4   a1 b1 c1 e4 a1
{code}

hmm. 
is it all right in theory ??

If you have a idea, let me know.

      was (Author: udanax):
    But, Can not insert duplicate key row in hbase table.
so........

r3
               a  b  c  e  f
===========================
row1.row4   a1 b1 c1 e1 a1
row1.row4   a1 b1 c1 e4 a1

hmm. 
is it all right in theory ??

I need a any advice.
  
> Sort Join Implementation
> ------------------------
>
>                 Key: HADOOP-2021
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2021
>             Project: Hadoop
>          Issue Type: Sub-task
>          Components: contrib/hbase
>    Affects Versions: 0.14.1
>         Environment: all environments  
>            Reporter: Edward Yoon
>            Priority: Minor
>             Fix For: 0.16.0
>
>         Attachments: 2021_v01.patch
>
>
> If we don't have an index for a domain in the join, we can still improve on the nested-loop join using sort join.
> {code}
> R1 = table('movieLog_table');
> R2 = table('stockCompany_info');
> result = R1.join(R1.studioName = R2.corporation) and R2;
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-2021) Sort Join Implementation

Posted by "Edward Yoon (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-2021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Edward Yoon updated HADOOP-2021:
--------------------------------

    Attachment: 2021_v05

I was tested successfully.

{code}
test:
     [echo] contrib: hbase
   [delete] Deleting directory /root/workspace/hadoop/build/contrib/hbase/test/logs
    [mkdir] Created dir: /root/workspace/hadoop/build/contrib/hbase/test/logs
    [junit] Running org.apache.hadoop.hbase.shell.TestHBaseShell
    [junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 112.719 sec
    [junit] Running org.apache.hadoop.hbase.shell.TestSubstitutionVariables
    [junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 0.017 sec
    [junit] Running org.apache.hadoop.hbase.shell.algebra.TestBooleanCondition
    [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.044 sec
    [junit] Running org.apache.hadoop.hbase.shell.algebra.TestBooleanTermFilter
    [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 82.184 sec
    [junit] Running org.apache.hadoop.hbase.shell.algebra.TestJoinCondition
    [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.161 sec
    [junit] Running org.apache.hadoop.hbase.shell.algebra.TestSortJoinMapReduce
    [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 96.241 sec
    [junit] Running org.apache.hadoop.hbase.shell.algebra.TestTableJoinMapReduce
    [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 28.076 sec
    [junit] Running org.apache.hadoop.hbase.util.TestBase64
    [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.135 sec
    [junit] Running org.apache.hadoop.hbase.util.TestKeying
    [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.011 sec
    [junit] Running org.onelab.test.TestFilter
    [junit] Tests run: 3, Failures: 0, Errors: 0, Time elapsed: 0.013 sec

BUILD SUCCESSFUL
Total time: 5 minutes 29 seconds
bash-3.00#
{code}

> Sort Join Implementation
> ------------------------
>
>                 Key: HADOOP-2021
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2021
>             Project: Hadoop
>          Issue Type: Sub-task
>          Components: contrib/hbase
>    Affects Versions: 0.14.1
>         Environment: all environments  
>            Reporter: Edward Yoon
>            Assignee: Edward Yoon
>            Priority: Minor
>             Fix For: 0.16.0
>
>         Attachments: 2021_v01.patch, 2021_v02.txt, 2021_v04.patch, 2021_v05
>
>
> If we don't have an index for a domain in the join, we can still improve on the nested-loop join using sort join.
> {code}
> R1 = table('movieLog_table');
> R2 = table('stockCompany_info');
> result = R1.join(R1.studioName = R2.corporation) and R2;
> {code}
> ----
> {code}
> r1
>        a     b    c
> ======================
> row1   a1    b1   c1
> row2   a2    b2   c2
> row3   a1    b3   c3
> r2
>        e     f
> ==================
> row1   e1    a1
> row2   e2    f2
> row3   e3    f3
> row4   e4    a1
> row5   e5    a2
> r1 = table('r1');
> r2 = table('r2');
> r3 = r1.join(r1.a = r2.f) and r2;
> ---------------------------------------------
> temp table T : Sorted set by "f"
>     row       e     f
> ===========================
> a1  row:row1  e1    a1
>     row:row4  e4    
> a2  row:row5  e5    a2
> f2  row:row2  e2    f2
> f3  row:row3  e3    f3
> ---------------------------------------------
> r3
>            r1.row   a    b    c   r2.row    e   f  
> ===================================================
> row1.row1  row1     a1   b1   c1  row1      e1  a1
> row1.row4  row1     a1   b1   c1  row4      e4  a1
> row2.row5  row2     a2   b2   c2  row5      e5  a2
> row3.row1  row3     a1   b3   c3  row1      e1  a1
> row3.row4  row3     a1   b3   c3  row4      e4  a1
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-2021) Sort Join Implementation

Posted by "Edward Yoon (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-2021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Edward Yoon updated HADOOP-2021:
--------------------------------

    Summary: Sort Join Implementation  (was: θ Join Condition)

> Sort Join Implementation
> ------------------------
>
>                 Key: HADOOP-2021
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2021
>             Project: Hadoop
>          Issue Type: Sub-task
>          Components: contrib/hbase
>    Affects Versions: 0.14.1
>         Environment: all environments  
>            Reporter: Edward Yoon
>            Priority: Minor
>             Fix For: 0.16.0
>
>
>  * stands for the usual comparison operators '<','<=', '>', '>=',  '!=', '='
>  * comparing terms in the Q clauses can be arbitrarily connected with boolean operators AND, NOT, OR
> {code}
> ex. (a.value = b.value and a.key != b.key)
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-2021) Sort Join Implementation

Posted by "Edward Yoon (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-2021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12543362 ] 

Edward Yoon commented on HADOOP-2021:
-------------------------------------

On second thought, I changed the result format.
And I had decision to store the join result as describe below.
Thanks, jimk and stack.

but, if you find a errors or inconsistencies, let me know.

{code}
r3
           r1.row  a  b  c   r2.row  e  f
=============================================
row1.row1  row1    a1 b1 c1  row1    e1 a1
row1.row4  row1    a1 b1 c1  row2    e4 a1
{code}

> Sort Join Implementation
> ------------------------
>
>                 Key: HADOOP-2021
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2021
>             Project: Hadoop
>          Issue Type: Sub-task
>          Components: contrib/hbase
>    Affects Versions: 0.14.1
>         Environment: all environments  
>            Reporter: Edward Yoon
>            Priority: Minor
>             Fix For: 0.16.0
>
>         Attachments: 2021_v01.patch
>
>
> If we don't have an index for a domain in the join, we can still improve on the nested-loop join using sort join.
> {code}
> R1 = table('movieLog_table');
> R2 = table('stockCompany_info');
> result = R1.join(R1.studioName = R2.corporation) and R2;
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-2021) [Hbase Shell] Sort Join Implementation

Posted by "Edward Yoon (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-2021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12547395 ] 

Edward Yoon commented on HADOOP-2021:
-------------------------------------

Hudson seems not run..
Anyone know about it?

> [Hbase Shell] Sort Join Implementation
> --------------------------------------
>
>                 Key: HADOOP-2021
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2021
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: contrib/hbase
>    Affects Versions: 0.15.0
>         Environment: all environments  
>            Reporter: Edward Yoon
>            Assignee: Edward Yoon
>             Fix For: 0.16.0
>
>         Attachments: 2021_v01.patch, 2021_v02.txt, 2021_v04.patch, 2021_v05, 2021_v05.patch, 2021_v06.patch
>
>
> If we don't have an index for a domain in the join, we can still improve on the nested-loop join using sort join.
> {code}
> R1 = table('movieLog_table');
> R2 = table('stockCompany_info');
> result = R1.join(R1.studioName = R2.corporation) and R2;
> {code}
> ----
> {code}
> r1
>        a     b    c
> ======================
> row1   a1    b1   c1
> row2   a2    b2   c2
> row3   a1    b3   c3
> r2
>        e     f
> ==================
> row1   e1    a1
> row2   e2    f2
> row3   e3    f3
> row4   e4    a1
> row5   e5    a2
> r1 = table('r1');
> r2 = table('r2');
> r3 = r1.join(r1.a = r2.f) and r2;
> ---------------------------------------------
> temp table T : Sorted set by "f"
>     row       
> =============
> a1  row:row1  
>     row:row4  
> a2  row:row5 
> f2  row:row2 
> f3  row:row3 
> ---------------------
> r3
>            r1.row   a    b    c   r2.row    e   f  
> ===================================================
> row1.row1  row1     a1   b1   c1  row1      e1  a1
> row1.row4  row1     a1   b1   c1  row4      e4  a1
> row2.row5  row2     a2   b2   c2  row5      e5  a2
> row3.row1  row3     a1   b3   c3  row1      e1  a1
> row3.row4  row3     a1   b3   c3  row4      e4  a1
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-2021) [Hbase Shell] Sort Join Implementation

Posted by "Edward Yoon (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-2021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Edward Yoon updated HADOOP-2021:
--------------------------------

    Status: Patch Available  (was: Open)

re-submitting after change the affects versions/s.

> [Hbase Shell] Sort Join Implementation
> --------------------------------------
>
>                 Key: HADOOP-2021
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2021
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: contrib/hbase
>    Affects Versions: 0.16.0
>         Environment: all environments  
>            Reporter: Edward Yoon
>            Assignee: Edward Yoon
>             Fix For: 0.16.0
>
>         Attachments: 2021_v01.patch, 2021_v02.txt, 2021_v04.patch, 2021_v05, 2021_v05.patch, 2021_v06.patch
>
>
> If we don't have an index for a domain in the join, we can still improve on the nested-loop join using sort join.
> {code}
> R1 = table('movieLog_table');
> R2 = table('stockCompany_info');
> result = R1.join(R1.studioName = R2.corporation) and R2;
> {code}
> ----
> {code}
> r1
>        a     b    c
> ======================
> row1   a1    b1   c1
> row2   a2    b2   c2
> row3   a1    b3   c3
> r2
>        e     f
> ==================
> row1   e1    a1
> row2   e2    f2
> row3   e3    f3
> row4   e4    a1
> row5   e5    a2
> r1 = table('r1');
> r2 = table('r2');
> r3 = r1.join(r1.a = r2.f) and r2;
> ---------------------------------------------
> temp table T : Sorted set by "f"
>     row       
> =============
> a1  row:row1  
>     row:row4  
> a2  row:row5 
> f2  row:row2 
> f3  row:row3 
> ---------------------
> r3
>            r1.row   a    b    c   r2.row    e   f  
> ===================================================
> row1.row1  row1     a1   b1   c1  row1      e1  a1
> row1.row4  row1     a1   b1   c1  row4      e4  a1
> row2.row5  row2     a2   b2   c2  row5      e5  a2
> row3.row1  row3     a1   b3   c3  row1      e1  a1
> row3.row4  row3     a1   b3   c3  row4      e4  a1
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-2021) [Hbase Shell] Sort Join Implementation

Posted by "Edward Yoon (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-2021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Edward Yoon updated HADOOP-2021:
--------------------------------

    Status: Patch Available  (was: Open)

> [Hbase Shell] Sort Join Implementation
> --------------------------------------
>
>                 Key: HADOOP-2021
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2021
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: contrib/hbase
>    Affects Versions: 0.15.1
>         Environment: all environments  
>            Reporter: Edward Yoon
>            Assignee: Edward Yoon
>             Fix For: 0.16.0
>
>         Attachments: 2021_v01.patch, 2021_v02.txt, 2021_v04.patch, 2021_v05, 2021_v05.patch, 2021_v06.patch
>
>
> If we don't have an index for a domain in the join, we can still improve on the nested-loop join using sort join.
> {code}
> R1 = table('movieLog_table');
> R2 = table('stockCompany_info');
> result = R1.join(R1.studioName = R2.corporation) and R2;
> {code}
> ----
> {code}
> r1
>        a     b    c
> ======================
> row1   a1    b1   c1
> row2   a2    b2   c2
> row3   a1    b3   c3
> r2
>        e     f
> ==================
> row1   e1    a1
> row2   e2    f2
> row3   e3    f3
> row4   e4    a1
> row5   e5    a2
> r1 = table('r1');
> r2 = table('r2');
> r3 = r1.join(r1.a = r2.f) and r2;
> ---------------------------------------------
> temp table T : Sorted set by "f"
>     row       
> =============
> a1  row:row1  
>     row:row4  
> a2  row:row5 
> f2  row:row2 
> f3  row:row3 
> ---------------------
> r3
>            r1.row   a    b    c   r2.row    e   f  
> ===================================================
> row1.row1  row1     a1   b1   c1  row1      e1  a1
> row1.row4  row1     a1   b1   c1  row4      e4  a1
> row2.row5  row2     a2   b2   c2  row5      e5  a2
> row3.row1  row3     a1   b3   c3  row1      e1  a1
> row3.row4  row3     a1   b3   c3  row4      e4  a1
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (HADOOP-2021) Sort Join Implementation

Posted by "Edward Yoon (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-2021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12543333 ] 

udanax edited comment on HADOOP-2021 at 11/17/07 7:32 PM:
---------------------------------------------------------------

But, Can not insert duplicate key row in hbase table.
so........

{code}
r3
            a  b  c  e  f
==============================
row1.row1   a1 b1 c1 e1 a1
row1.row4   a1 b1 c1 e4 a1
{code}

Then, Can i get a r1.row set or r2.row set??

hmm. 
Also, is it all right in theory ??

If you have any ideas, let me know.

      was (Author: udanax):
    But, Can not insert duplicate key row in hbase table.
so........

{code}
r3
            a  b  c  e  f
==============================
row1.row4   a1 b1 c1 e1 a1
row1.row4   a1 b1 c1 e4 a1
{code}

Then, Can i get a r1.row set or r2.row set??

hmm. 
Also, is it all right in theory ??

If you have any ideas, let me know.
  
> Sort Join Implementation
> ------------------------
>
>                 Key: HADOOP-2021
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2021
>             Project: Hadoop
>          Issue Type: Sub-task
>          Components: contrib/hbase
>    Affects Versions: 0.14.1
>         Environment: all environments  
>            Reporter: Edward Yoon
>            Priority: Minor
>             Fix For: 0.16.0
>
>         Attachments: 2021_v01.patch
>
>
> If we don't have an index for a domain in the join, we can still improve on the nested-loop join using sort join.
> {code}
> R1 = table('movieLog_table');
> R2 = table('stockCompany_info');
> result = R1.join(R1.studioName = R2.corporation) and R2;
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-2021) Sort Join Implementation

Posted by "Edward Yoon (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-2021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Edward Yoon updated HADOOP-2021:
--------------------------------

    Description: 
If we don't have an index for a domain in the join, we can still improve on the nested-loop join using sort join.

1. make a sorted set temp file for sort join using MR job on hdfs
2. make a new Relation table on hbase

{code}
R1 = table('movieLog_table');
R2 = table('stockCompany_info');
result = R1.join(R1.studioName = R2.corporation) and R2;
{code}


  was:
If we don't have an index for a domain in the join, we can still improve on the nested-loop join using sort join.

1. make a sorted set temp file for sort join using MR job
2. make a new Relation table

{code}
R1 = table('movieLog_table');
R2 = table('stockCompany_info');
result = R1.join(R1.studioName = R2.corporation) and R2;
{code}



> Sort Join Implementation
> ------------------------
>
>                 Key: HADOOP-2021
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2021
>             Project: Hadoop
>          Issue Type: Sub-task
>          Components: contrib/hbase
>    Affects Versions: 0.14.1
>         Environment: all environments  
>            Reporter: Edward Yoon
>            Priority: Minor
>             Fix For: 0.16.0
>
>         Attachments: 2021_v01.patch
>
>
> If we don't have an index for a domain in the join, we can still improve on the nested-loop join using sort join.
> 1. make a sorted set temp file for sort join using MR job on hdfs
> 2. make a new Relation table on hbase
> {code}
> R1 = table('movieLog_table');
> R2 = table('stockCompany_info');
> result = R1.join(R1.studioName = R2.corporation) and R2;
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-2021) Sort Join Implementation

Posted by "Edward Yoon (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-2021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Edward Yoon updated HADOOP-2021:
--------------------------------

    Issue Type: Improvement  (was: Sub-task)
        Parent:     (was: HADOOP-1608)

> Sort Join Implementation
> ------------------------
>
>                 Key: HADOOP-2021
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2021
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: contrib/hbase
>    Affects Versions: 0.15.0
>         Environment: all environments  
>            Reporter: Edward Yoon
>            Assignee: Edward Yoon
>            Priority: Minor
>             Fix For: 0.16.0
>
>         Attachments: 2021_v01.patch, 2021_v02.txt, 2021_v04.patch, 2021_v05, 2021_v05.patch, 2021_v06.patch
>
>
> If we don't have an index for a domain in the join, we can still improve on the nested-loop join using sort join.
> {code}
> R1 = table('movieLog_table');
> R2 = table('stockCompany_info');
> result = R1.join(R1.studioName = R2.corporation) and R2;
> {code}
> ----
> {code}
> r1
>        a     b    c
> ======================
> row1   a1    b1   c1
> row2   a2    b2   c2
> row3   a1    b3   c3
> r2
>        e     f
> ==================
> row1   e1    a1
> row2   e2    f2
> row3   e3    f3
> row4   e4    a1
> row5   e5    a2
> r1 = table('r1');
> r2 = table('r2');
> r3 = r1.join(r1.a = r2.f) and r2;
> ---------------------------------------------
> temp table T : Sorted set by "f"
>     row       
> =============
> a1  row:row1  
>     row:row4  
> a2  row:row5 
> f2  row:row2 
> f3  row:row3 
> ---------------------
> r3
>            r1.row   a    b    c   r2.row    e   f  
> ===================================================
> row1.row1  row1     a1   b1   c1  row1      e1  a1
> row1.row4  row1     a1   b1   c1  row4      e4  a1
> row2.row5  row2     a2   b2   c2  row5      e5  a2
> row3.row1  row3     a1   b3   c3  row1      e1  a1
> row3.row4  row3     a1   b3   c3  row4      e4  a1
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-2021) Sort Join Implementation

Posted by "Edward Yoon (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-2021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Edward Yoon updated HADOOP-2021:
--------------------------------

    Affects Version/s:     (was: 0.14.1)
                       0.15.0
               Status: Patch Available  (was: Open)

re-submitting.

> Sort Join Implementation
> ------------------------
>
>                 Key: HADOOP-2021
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2021
>             Project: Hadoop
>          Issue Type: Sub-task
>          Components: contrib/hbase
>    Affects Versions: 0.15.0
>         Environment: all environments  
>            Reporter: Edward Yoon
>            Assignee: Edward Yoon
>            Priority: Minor
>             Fix For: 0.16.0
>
>         Attachments: 2021_v01.patch, 2021_v02.txt, 2021_v04.patch, 2021_v05, 2021_v05.patch, 2021_v06.patch
>
>
> If we don't have an index for a domain in the join, we can still improve on the nested-loop join using sort join.
> {code}
> R1 = table('movieLog_table');
> R2 = table('stockCompany_info');
> result = R1.join(R1.studioName = R2.corporation) and R2;
> {code}
> ----
> {code}
> r1
>        a     b    c
> ======================
> row1   a1    b1   c1
> row2   a2    b2   c2
> row3   a1    b3   c3
> r2
>        e     f
> ==================
> row1   e1    a1
> row2   e2    f2
> row3   e3    f3
> row4   e4    a1
> row5   e5    a2
> r1 = table('r1');
> r2 = table('r2');
> r3 = r1.join(r1.a = r2.f) and r2;
> ---------------------------------------------
> temp table T : Sorted set by "f"
>     row       
> =============
> a1  row:row1  
>     row:row4  
> a2  row:row5 
> f2  row:row2 
> f3  row:row3 
> ---------------------
> r3
>            r1.row   a    b    c   r2.row    e   f  
> ===================================================
> row1.row1  row1     a1   b1   c1  row1      e1  a1
> row1.row4  row1     a1   b1   c1  row4      e4  a1
> row2.row5  row2     a2   b2   c2  row5      e5  a2
> row3.row1  row3     a1   b3   c3  row1      e1  a1
> row3.row4  row3     a1   b3   c3  row4      e4  a1
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-2021) Sort Join Implementation

Posted by "Edward Yoon (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-2021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Edward Yoon updated HADOOP-2021:
--------------------------------

    Attachment: 2021_v04.patch

> Sort Join Implementation
> ------------------------
>
>                 Key: HADOOP-2021
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2021
>             Project: Hadoop
>          Issue Type: Sub-task
>          Components: contrib/hbase
>    Affects Versions: 0.14.1
>         Environment: all environments  
>            Reporter: Edward Yoon
>            Assignee: Edward Yoon
>            Priority: Minor
>             Fix For: 0.16.0
>
>         Attachments: 2021_v01.patch, 2021_v02.txt, 2021_v04.patch
>
>
> If we don't have an index for a domain in the join, we can still improve on the nested-loop join using sort join.
> {code}
> R1 = table('movieLog_table');
> R2 = table('stockCompany_info');
> result = R1.join(R1.studioName = R2.corporation) and R2;
> {code}
> ----
> {code}
> r1
>        a     b    c
> ======================
> row1   a1    b1   c1
> row2   a2    b2   c2
> row3   a1    b3   c3
> r2
>        e     f
> ==================
> row1   e1    a1
> row2   e2    f2
> row3   e3    f3
> row4   e4    a1
> row5   e5    a2
> r1 = table('r1');
> r2 = table('r2');
> r3 = r1.join(r1.a = r2.f) and r2;
> ---------------------------------------------
> temp table T : Sorted set by "f"
>     row       e     f
> ===========================
> a1  row:row1  e1    a1
>     row:row4  e4    
> a2  row:row5  e5    a2
> f2  row:row2  e2    f2
> f3  row:row3  e3    f3
> ---------------------------------------------
> r3
>            r1.row   a    b    c   r2.row    e   f  
> ===================================================
> row1.row1  row1     a1   b1   c1  row1      e1  a1
> row1.row4  row1     a1   b1   c1  row4      e4  a1
> row2.row5  row2     a2   b2   c2  row5      e5  a2
> row3.row1  row3     a1   b3   c3  row1      e1  a1
> row3.row4  row3     a1   b3   c3  row4      e4  a1
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-2021) Sort Join Implementation

Posted by "Edward Yoon (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-2021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Edward Yoon updated HADOOP-2021:
--------------------------------

    Status: Open  (was: Patch Available)

> Sort Join Implementation
> ------------------------
>
>                 Key: HADOOP-2021
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2021
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: contrib/hbase
>    Affects Versions: 0.15.0
>         Environment: all environments  
>            Reporter: Edward Yoon
>            Assignee: Edward Yoon
>            Priority: Minor
>             Fix For: 0.16.0
>
>         Attachments: 2021_v01.patch, 2021_v02.txt, 2021_v04.patch, 2021_v05, 2021_v05.patch, 2021_v06.patch
>
>
> If we don't have an index for a domain in the join, we can still improve on the nested-loop join using sort join.
> {code}
> R1 = table('movieLog_table');
> R2 = table('stockCompany_info');
> result = R1.join(R1.studioName = R2.corporation) and R2;
> {code}
> ----
> {code}
> r1
>        a     b    c
> ======================
> row1   a1    b1   c1
> row2   a2    b2   c2
> row3   a1    b3   c3
> r2
>        e     f
> ==================
> row1   e1    a1
> row2   e2    f2
> row3   e3    f3
> row4   e4    a1
> row5   e5    a2
> r1 = table('r1');
> r2 = table('r2');
> r3 = r1.join(r1.a = r2.f) and r2;
> ---------------------------------------------
> temp table T : Sorted set by "f"
>     row       
> =============
> a1  row:row1  
>     row:row4  
> a2  row:row5 
> f2  row:row2 
> f3  row:row3 
> ---------------------
> r3
>            r1.row   a    b    c   r2.row    e   f  
> ===================================================
> row1.row1  row1     a1   b1   c1  row1      e1  a1
> row1.row4  row1     a1   b1   c1  row4      e4  a1
> row2.row5  row2     a2   b2   c2  row5      e5  a2
> row3.row1  row3     a1   b3   c3  row1      e1  a1
> row3.row4  row3     a1   b3   c3  row4      e4  a1
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-2021) Sort Join Implementation

Posted by "Edward Yoon (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-2021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Edward Yoon updated HADOOP-2021:
--------------------------------

    Description: 
If we don't have an index for a domain in the join, we can still improve on the nested-loop join using sort join.

{code}
R1 = table('movieLog_table');
R2 = table('stockCompany_info');
result = R1.join(R1.studioName = R2.corporation) and R2;
{code}


  was:If we don't have an index for a domain in the join, we can still improve on the nested-loop join using sort join.


update description using example.

> Sort Join Implementation
> ------------------------
>
>                 Key: HADOOP-2021
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2021
>             Project: Hadoop
>          Issue Type: Sub-task
>          Components: contrib/hbase
>    Affects Versions: 0.14.1
>         Environment: all environments  
>            Reporter: Edward Yoon
>            Priority: Minor
>             Fix For: 0.16.0
>
>
> If we don't have an index for a domain in the join, we can still improve on the nested-loop join using sort join.
> {code}
> R1 = table('movieLog_table');
> R2 = table('stockCompany_info');
> result = R1.join(R1.studioName = R2.corporation) and R2;
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-2021) θ Join Condition

Posted by "Edward Yoon (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-2021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Edward Yoon updated HADOOP-2021:
--------------------------------

          Component/s: contrib/hbase
        Fix Version/s: 0.16.0
             Priority: Minor  (was: Major)
          Description: 
 * stands for the usual comparison operators '<','<=', '>', '>=',  '!=', '='
 * comparing terms in the Q clauses can be arbitrarily connected with boolean operators AND, NOT, OR


{code}
ex. (a.value = b.value and a.key != b.key)
{code}
          Environment: all environments  
    Affects Version/s: 0.14.1

> θ Join Condition
> ----------------
>
>                 Key: HADOOP-2021
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2021
>             Project: Hadoop
>          Issue Type: Sub-task
>          Components: contrib/hbase
>    Affects Versions: 0.14.1
>         Environment: all environments  
>            Reporter: Edward Yoon
>            Priority: Minor
>             Fix For: 0.16.0
>
>
>  * stands for the usual comparison operators '<','<=', '>', '>=',  '!=', '='
>  * comparing terms in the Q clauses can be arbitrarily connected with boolean operators AND, NOT, OR
> {code}
> ex. (a.value = b.value and a.key != b.key)
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-2021) Sort Join Implementation

Posted by "Edward Yoon (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-2021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Edward Yoon updated HADOOP-2021:
--------------------------------

    Status: Patch Available  (was: Open)

re-submitting.

> Sort Join Implementation
> ------------------------
>
>                 Key: HADOOP-2021
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2021
>             Project: Hadoop
>          Issue Type: Sub-task
>          Components: contrib/hbase
>    Affects Versions: 0.14.1
>         Environment: all environments  
>            Reporter: Edward Yoon
>            Assignee: Edward Yoon
>            Priority: Minor
>             Fix For: 0.16.0
>
>         Attachments: 2021_v01.patch, 2021_v02.txt, 2021_v04.patch, 2021_v05, 2021_v05.patch, 2021_v06.patch
>
>
> If we don't have an index for a domain in the join, we can still improve on the nested-loop join using sort join.
> {code}
> R1 = table('movieLog_table');
> R2 = table('stockCompany_info');
> result = R1.join(R1.studioName = R2.corporation) and R2;
> {code}
> ----
> {code}
> r1
>        a     b    c
> ======================
> row1   a1    b1   c1
> row2   a2    b2   c2
> row3   a1    b3   c3
> r2
>        e     f
> ==================
> row1   e1    a1
> row2   e2    f2
> row3   e3    f3
> row4   e4    a1
> row5   e5    a2
> r1 = table('r1');
> r2 = table('r2');
> r3 = r1.join(r1.a = r2.f) and r2;
> ---------------------------------------------
> temp table T : Sorted set by "f"
>     row       e     f
> ===========================
> a1  row:row1  e1    a1
>     row:row4  e4    
> a2  row:row5  e5    a2
> f2  row:row2  e2    f2
> f3  row:row3  e3    f3
> ---------------------------------------------
> r3
>            r1.row   a    b    c   r2.row    e   f  
> ===================================================
> row1.row1  row1     a1   b1   c1  row1      e1  a1
> row1.row4  row1     a1   b1   c1  row4      e4  a1
> row2.row5  row2     a2   b2   c2  row5      e5  a2
> row3.row1  row3     a1   b3   c3  row1      e1  a1
> row3.row4  row3     a1   b3   c3  row4      e4  a1
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (HADOOP-2021) Sort Join Implementation

Posted by "Edward Yoon (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-2021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12543333 ] 

udanax edited comment on HADOOP-2021 at 11/17/07 7:24 PM:
---------------------------------------------------------------

But, Can not insert duplicate key row in hbase table.
so........

{code}
r3
            a  b  c  e  f
==============================
row1.row4   a1 b1 c1 e1 a1
row1.row4   a1 b1 c1 e4 a1
{code}

Then, Can i get a r1.row set or r2.row set??

hmm. 
Also, is it all right in theory ??

If you have any ideas, let me know.

      was (Author: udanax):
    But, Can not insert duplicate key row in hbase table.
so........

{code}
r3
            a  b  c  e  f
==============================
row1.row4   a1 b1 c1 e1 a1
row1.row4   a1 b1 c1 e4 a1
{code}

hmm. 
is it all right in theory ??

If you have any ideas, let me know.
  
> Sort Join Implementation
> ------------------------
>
>                 Key: HADOOP-2021
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2021
>             Project: Hadoop
>          Issue Type: Sub-task
>          Components: contrib/hbase
>    Affects Versions: 0.14.1
>         Environment: all environments  
>            Reporter: Edward Yoon
>            Priority: Minor
>             Fix For: 0.16.0
>
>         Attachments: 2021_v01.patch
>
>
> If we don't have an index for a domain in the join, we can still improve on the nested-loop join using sort join.
> {code}
> R1 = table('movieLog_table');
> R2 = table('stockCompany_info');
> result = R1.join(R1.studioName = R2.corporation) and R2;
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-2021) Sort Join Implementation

Posted by "Edward Yoon (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-2021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Edward Yoon updated HADOOP-2021:
--------------------------------

    Description: If we don't have an index for a domain in the join, we can still improve on the nested-loop join using sort join.  (was:  * stands for the usual comparison operators '<','<=', '>', '>=',  '!=', '='
 * comparing terms in the Q clauses can be arbitrarily connected with boolean operators AND, NOT, OR


{code}
ex. (a.value = b.value and a.key != b.key)
{code})

> Sort Join Implementation
> ------------------------
>
>                 Key: HADOOP-2021
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2021
>             Project: Hadoop
>          Issue Type: Sub-task
>          Components: contrib/hbase
>    Affects Versions: 0.14.1
>         Environment: all environments  
>            Reporter: Edward Yoon
>            Priority: Minor
>             Fix For: 0.16.0
>
>
> If we don't have an index for a domain in the join, we can still improve on the nested-loop join using sort join.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-2021) Sort Join Implementation

Posted by "Edward Yoon (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-2021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Edward Yoon updated HADOOP-2021:
--------------------------------

    Attachment: 2021_v01.patch

> Sort Join Implementation
> ------------------------
>
>                 Key: HADOOP-2021
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2021
>             Project: Hadoop
>          Issue Type: Sub-task
>          Components: contrib/hbase
>    Affects Versions: 0.14.1
>         Environment: all environments  
>            Reporter: Edward Yoon
>            Priority: Minor
>             Fix For: 0.16.0
>
>         Attachments: 2021_v01.patch
>
>
> If we don't have an index for a domain in the join, we can still improve on the nested-loop join using sort join.
> {code}
> R1 = table('movieLog_table');
> R2 = table('stockCompany_info');
> result = R1.join(R1.studioName = R2.corporation) and R2;
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-2021) Sort Join Implementation

Posted by "Edward Yoon (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-2021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Edward Yoon updated HADOOP-2021:
--------------------------------

    Attachment:     (was: patch.txt)

> Sort Join Implementation
> ------------------------
>
>                 Key: HADOOP-2021
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2021
>             Project: Hadoop
>          Issue Type: Sub-task
>          Components: contrib/hbase
>    Affects Versions: 0.14.1
>         Environment: all environments  
>            Reporter: Edward Yoon
>            Assignee: Edward Yoon
>            Priority: Minor
>             Fix For: 0.16.0
>
>         Attachments: 2021_v01.patch, 2021_v02.txt, 2021_v04.patch
>
>
> If we don't have an index for a domain in the join, we can still improve on the nested-loop join using sort join.
> {code}
> R1 = table('movieLog_table');
> R2 = table('stockCompany_info');
> result = R1.join(R1.studioName = R2.corporation) and R2;
> {code}
> ----
> {code}
> r1
>        a     b    c
> ======================
> row1   a1    b1   c1
> row2   a2    b2   c2
> row3   a1    b3   c3
> r2
>        e     f
> ==================
> row1   e1    a1
> row2   e2    f2
> row3   e3    f3
> row4   e4    a1
> row5   e5    a2
> r1 = table('r1');
> r2 = table('r2');
> r3 = r1.join(r1.a = r2.f) and r2;
> ---------------------------------------------
> temp table T : Sorted set by "f"
>     row       e     f
> ===========================
> a1  row:row1  e1    a1
>     row:row4  e4    
> a2  row:row5  e5    a2
> f2  row:row2  e2    f2
> f3  row:row3  e3    f3
> ---------------------------------------------
> r3
>            r1.row   a    b    c   r2.row    e   f  
> ===================================================
> row1.row1  row1     a1   b1   c1  row1      e1  a1
> row1.row4  row1     a1   b1   c1  row4      e4  a1
> row2.row5  row2     a2   b2   c2  row5      e5  a2
> row3.row1  row3     a1   b3   c3  row1      e1  a1
> row3.row4  row3     a1   b3   c3  row4      e4  a1
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-2021) Sort Join Implementation

Posted by "Edward Yoon (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-2021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Edward Yoon updated HADOOP-2021:
--------------------------------

    Description: 
If we don't have an index for a domain in the join, we can still improve on the nested-loop join using sort join.

{code}
R1 = table('movieLog_table');
R2 = table('stockCompany_info');
result = R1.join(R1.studioName = R2.corporation) and R2;
{code}

----
{code}
r1
       a     b    c
======================
row1   a1    b1   c1
row2   a2    b2   c2
row3   a1    b3   c3

r2
       e     f
==================
row1   e1    a1
row2   e2    f2
row3   e3    f3
row4   e4    a1
row5   e5    a2

r1 = table('r1');
r2 = table('r2');
r3 = r1.join(r1.a = r2.f) and r2;

---------------------------------------------
temp table T : Sorted set by "f"

    row       
=============
a1  row:row1  
    row:row4  
a2  row:row5 
f2  row:row2 
f3  row:row3 
---------------------

r3
           r1.row   a    b    c   r2.row    e   f  
===================================================
row1.row1  row1     a1   b1   c1  row1      e1  a1
row1.row4  row1     a1   b1   c1  row4      e4  a1
row2.row5  row2     a2   b2   c2  row5      e5  a2
row3.row1  row3     a1   b3   c3  row1      e1  a1
row3.row4  row3     a1   b3   c3  row4      e4  a1
{code}





  was:
If we don't have an index for a domain in the join, we can still improve on the nested-loop join using sort join.

{code}
R1 = table('movieLog_table');
R2 = table('stockCompany_info');
result = R1.join(R1.studioName = R2.corporation) and R2;
{code}

----
{code}
r1
       a     b    c
======================
row1   a1    b1   c1
row2   a2    b2   c2
row3   a1    b3   c3

r2
       e     f
==================
row1   e1    a1
row2   e2    f2
row3   e3    f3
row4   e4    a1
row5   e5    a2

r1 = table('r1');
r2 = table('r2');
r3 = r1.join(r1.a = r2.f) and r2;

---------------------------------------------
temp table T : Sorted set by "f"

    row       e     f
===========================
a1  row:row1  e1    a1
    row:row4  e4    
a2  row:row5  e5    a2
f2  row:row2  e2    f2
f3  row:row3  e3    f3
---------------------------------------------

r3
           r1.row   a    b    c   r2.row    e   f  
===================================================
row1.row1  row1     a1   b1   c1  row1      e1  a1
row1.row4  row1     a1   b1   c1  row4      e4  a1
row2.row5  row2     a2   b2   c2  row5      e5  a2
row3.row1  row3     a1   b3   c3  row1      e1  a1
row3.row4  row3     a1   b3   c3  row4      e4  a1
{code}






> Sort Join Implementation
> ------------------------
>
>                 Key: HADOOP-2021
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2021
>             Project: Hadoop
>          Issue Type: Sub-task
>          Components: contrib/hbase
>    Affects Versions: 0.14.1
>         Environment: all environments  
>            Reporter: Edward Yoon
>            Assignee: Edward Yoon
>            Priority: Minor
>             Fix For: 0.16.0
>
>         Attachments: 2021_v01.patch, 2021_v02.txt, 2021_v04.patch, 2021_v05, 2021_v05.patch, 2021_v06.patch
>
>
> If we don't have an index for a domain in the join, we can still improve on the nested-loop join using sort join.
> {code}
> R1 = table('movieLog_table');
> R2 = table('stockCompany_info');
> result = R1.join(R1.studioName = R2.corporation) and R2;
> {code}
> ----
> {code}
> r1
>        a     b    c
> ======================
> row1   a1    b1   c1
> row2   a2    b2   c2
> row3   a1    b3   c3
> r2
>        e     f
> ==================
> row1   e1    a1
> row2   e2    f2
> row3   e3    f3
> row4   e4    a1
> row5   e5    a2
> r1 = table('r1');
> r2 = table('r2');
> r3 = r1.join(r1.a = r2.f) and r2;
> ---------------------------------------------
> temp table T : Sorted set by "f"
>     row       
> =============
> a1  row:row1  
>     row:row4  
> a2  row:row5 
> f2  row:row2 
> f3  row:row3 
> ---------------------
> r3
>            r1.row   a    b    c   r2.row    e   f  
> ===================================================
> row1.row1  row1     a1   b1   c1  row1      e1  a1
> row1.row4  row1     a1   b1   c1  row4      e4  a1
> row2.row5  row2     a2   b2   c2  row5      e5  a2
> row3.row1  row3     a1   b3   c3  row1      e1  a1
> row3.row4  row3     a1   b3   c3  row4      e4  a1
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-2021) [Hbase Shell] Sort Join Implementation

Posted by "Edward Yoon (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-2021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Edward Yoon updated HADOOP-2021:
--------------------------------

    Affects Version/s:     (was: 0.15.0)
                       0.15.1
               Status: Open  (was: Patch Available)

> [Hbase Shell] Sort Join Implementation
> --------------------------------------
>
>                 Key: HADOOP-2021
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2021
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: contrib/hbase
>    Affects Versions: 0.15.1
>         Environment: all environments  
>            Reporter: Edward Yoon
>            Assignee: Edward Yoon
>             Fix For: 0.16.0
>
>         Attachments: 2021_v01.patch, 2021_v02.txt, 2021_v04.patch, 2021_v05, 2021_v05.patch, 2021_v06.patch
>
>
> If we don't have an index for a domain in the join, we can still improve on the nested-loop join using sort join.
> {code}
> R1 = table('movieLog_table');
> R2 = table('stockCompany_info');
> result = R1.join(R1.studioName = R2.corporation) and R2;
> {code}
> ----
> {code}
> r1
>        a     b    c
> ======================
> row1   a1    b1   c1
> row2   a2    b2   c2
> row3   a1    b3   c3
> r2
>        e     f
> ==================
> row1   e1    a1
> row2   e2    f2
> row3   e3    f3
> row4   e4    a1
> row5   e5    a2
> r1 = table('r1');
> r2 = table('r2');
> r3 = r1.join(r1.a = r2.f) and r2;
> ---------------------------------------------
> temp table T : Sorted set by "f"
>     row       
> =============
> a1  row:row1  
>     row:row4  
> a2  row:row5 
> f2  row:row2 
> f3  row:row3 
> ---------------------
> r3
>            r1.row   a    b    c   r2.row    e   f  
> ===================================================
> row1.row1  row1     a1   b1   c1  row1      e1  a1
> row1.row4  row1     a1   b1   c1  row4      e4  a1
> row2.row5  row2     a2   b2   c2  row5      e5  a2
> row3.row1  row3     a1   b3   c3  row1      e1  a1
> row3.row4  row3     a1   b3   c3  row4      e4  a1
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-2021) Sort Join Implementation

Posted by "Edward Yoon (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-2021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Edward Yoon updated HADOOP-2021:
--------------------------------

    Attachment: patch.txt

> Sort Join Implementation
> ------------------------
>
>                 Key: HADOOP-2021
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2021
>             Project: Hadoop
>          Issue Type: Sub-task
>          Components: contrib/hbase
>    Affects Versions: 0.14.1
>         Environment: all environments  
>            Reporter: Edward Yoon
>            Assignee: Edward Yoon
>            Priority: Minor
>             Fix For: 0.16.0
>
>         Attachments: 2021_v01.patch, 2021_v02.txt, patch.txt
>
>
> If we don't have an index for a domain in the join, we can still improve on the nested-loop join using sort join.
> {code}
> R1 = table('movieLog_table');
> R2 = table('stockCompany_info');
> result = R1.join(R1.studioName = R2.corporation) and R2;
> {code}
> ----
> {code}
> r1
>        a     b    c
> ======================
> row1   a1    b1   c1
> row2   a2    b2   c2
> row3   a1    b3   c3
> r2
>        e     f
> ==================
> row1   e1    a1
> row2   e2    f2
> row3   e3    f3
> row4   e4    a1
> row5   e5    a2
> r1 = table('r1');
> r2 = table('r2');
> r3 = r1.join(r1.a = r2.f) and r2;
> ---------------------------------------------
> temp table T : Sorted set by "f"
>     row       e     f
> ===========================
> a1  row:row1  e1    a1
>     row:row4  e4    
> a2  row:row5  e5    a2
> f2  row:row2  e2    f2
> f3  row:row3  e3    f3
> ---------------------------------------------
> r3
>            r1.row   a    b    c   r2.row    e   f  
> ===================================================
> row1.row1  row1     a1   b1   c1  row1      e1  a1
> row1.row4  row1     a1   b1   c1  row4      e4  a1
> row2.row5  row2     a2   b2   c2  row5      e5  a2
> row3.row1  row3     a1   b3   c3  row1      e1  a1
> row3.row4  row3     a1   b3   c3  row4      e4  a1
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-2021) [Hbase Shell] Sort Join Implementation

Posted by "Edward Yoon (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-2021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Edward Yoon updated HADOOP-2021:
--------------------------------

    Status: Patch Available  (was: In Progress)

> [Hbase Shell] Sort Join Implementation
> --------------------------------------
>
>                 Key: HADOOP-2021
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2021
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: contrib/hbase
>    Affects Versions: 0.15.0
>         Environment: all environments  
>            Reporter: Edward Yoon
>            Assignee: Edward Yoon
>             Fix For: 0.16.0
>
>         Attachments: 2021_v01.patch, 2021_v02.txt, 2021_v04.patch, 2021_v05, 2021_v05.patch, 2021_v06.patch
>
>
> If we don't have an index for a domain in the join, we can still improve on the nested-loop join using sort join.
> {code}
> R1 = table('movieLog_table');
> R2 = table('stockCompany_info');
> result = R1.join(R1.studioName = R2.corporation) and R2;
> {code}
> ----
> {code}
> r1
>        a     b    c
> ======================
> row1   a1    b1   c1
> row2   a2    b2   c2
> row3   a1    b3   c3
> r2
>        e     f
> ==================
> row1   e1    a1
> row2   e2    f2
> row3   e3    f3
> row4   e4    a1
> row5   e5    a2
> r1 = table('r1');
> r2 = table('r2');
> r3 = r1.join(r1.a = r2.f) and r2;
> ---------------------------------------------
> temp table T : Sorted set by "f"
>     row       
> =============
> a1  row:row1  
>     row:row4  
> a2  row:row5 
> f2  row:row2 
> f3  row:row3 
> ---------------------
> r3
>            r1.row   a    b    c   r2.row    e   f  
> ===================================================
> row1.row1  row1     a1   b1   c1  row1      e1  a1
> row1.row4  row1     a1   b1   c1  row4      e4  a1
> row2.row5  row2     a2   b2   c2  row5      e5  a2
> row3.row1  row3     a1   b3   c3  row1      e1  a1
> row3.row4  row3     a1   b3   c3  row4      e4  a1
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-2021) [Hbase Shell] Sort Join Implementation

Posted by "Edward Yoon (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-2021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Edward Yoon updated HADOOP-2021:
--------------------------------

    Affects Version/s:     (was: 0.15.1)
                       0.16.0

ah.... changing the affects version/s.

> [Hbase Shell] Sort Join Implementation
> --------------------------------------
>
>                 Key: HADOOP-2021
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2021
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: contrib/hbase
>    Affects Versions: 0.16.0
>         Environment: all environments  
>            Reporter: Edward Yoon
>            Assignee: Edward Yoon
>             Fix For: 0.16.0
>
>         Attachments: 2021_v01.patch, 2021_v02.txt, 2021_v04.patch, 2021_v05, 2021_v05.patch, 2021_v06.patch
>
>
> If we don't have an index for a domain in the join, we can still improve on the nested-loop join using sort join.
> {code}
> R1 = table('movieLog_table');
> R2 = table('stockCompany_info');
> result = R1.join(R1.studioName = R2.corporation) and R2;
> {code}
> ----
> {code}
> r1
>        a     b    c
> ======================
> row1   a1    b1   c1
> row2   a2    b2   c2
> row3   a1    b3   c3
> r2
>        e     f
> ==================
> row1   e1    a1
> row2   e2    f2
> row3   e3    f3
> row4   e4    a1
> row5   e5    a2
> r1 = table('r1');
> r2 = table('r2');
> r3 = r1.join(r1.a = r2.f) and r2;
> ---------------------------------------------
> temp table T : Sorted set by "f"
>     row       
> =============
> a1  row:row1  
>     row:row4  
> a2  row:row5 
> f2  row:row2 
> f3  row:row3 
> ---------------------
> r3
>            r1.row   a    b    c   r2.row    e   f  
> ===================================================
> row1.row1  row1     a1   b1   c1  row1      e1  a1
> row1.row4  row1     a1   b1   c1  row4      e4  a1
> row2.row5  row2     a2   b2   c2  row5      e5  a2
> row3.row1  row3     a1   b3   c3  row1      e1  a1
> row3.row4  row3     a1   b3   c3  row4      e4  a1
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (HADOOP-2021) Sort Join Implementation

Posted by "Edward Yoon (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-2021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12543362 ] 

udanax edited comment on HADOOP-2021 at 11/18/07 2:02 AM:
---------------------------------------------------------------

On second thought, I changed the result format.
And I had decision to store the join result as describe below.
Thanks, jimk and stack.

but, if you find an errors or inconsistencies, let me know.

{code}
r3
           r1.row  a  b  c   r2.row  e  f
=============================================
row1.row1  row1    a1 b1 c1  row1    e1 a1
row1.row4  row1    a1 b1 c1  row2    e4 a1
{code}

      was (Author: udanax):
    On second thought, I changed the result format.
And I had decision to store the join result as describe below.
Thanks, jimk and stack.

but, if you find a errors or inconsistencies, let me know.

{code}
r3
           r1.row  a  b  c   r2.row  e  f
=============================================
row1.row1  row1    a1 b1 c1  row1    e1 a1
row1.row4  row1    a1 b1 c1  row2    e4 a1
{code}
  
> Sort Join Implementation
> ------------------------
>
>                 Key: HADOOP-2021
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2021
>             Project: Hadoop
>          Issue Type: Sub-task
>          Components: contrib/hbase
>    Affects Versions: 0.14.1
>         Environment: all environments  
>            Reporter: Edward Yoon
>            Priority: Minor
>             Fix For: 0.16.0
>
>         Attachments: 2021_v01.patch
>
>
> If we don't have an index for a domain in the join, we can still improve on the nested-loop join using sort join.
> {code}
> R1 = table('movieLog_table');
> R2 = table('stockCompany_info');
> result = R1.join(R1.studioName = R2.corporation) and R2;
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-2021) Sort Join Implementation

Posted by "Edward Yoon (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-2021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Edward Yoon updated HADOOP-2021:
--------------------------------

    Status: Open  (was: Patch Available)

Canceling, it seems not registered.

> Sort Join Implementation
> ------------------------
>
>                 Key: HADOOP-2021
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2021
>             Project: Hadoop
>          Issue Type: Sub-task
>          Components: contrib/hbase
>    Affects Versions: 0.14.1
>         Environment: all environments  
>            Reporter: Edward Yoon
>            Assignee: Edward Yoon
>            Priority: Minor
>             Fix For: 0.16.0
>
>         Attachments: 2021_v01.patch, 2021_v02.txt, 2021_v04.patch, 2021_v05, 2021_v05.patch, 2021_v06.patch
>
>
> If we don't have an index for a domain in the join, we can still improve on the nested-loop join using sort join.
> {code}
> R1 = table('movieLog_table');
> R2 = table('stockCompany_info');
> result = R1.join(R1.studioName = R2.corporation) and R2;
> {code}
> ----
> {code}
> r1
>        a     b    c
> ======================
> row1   a1    b1   c1
> row2   a2    b2   c2
> row3   a1    b3   c3
> r2
>        e     f
> ==================
> row1   e1    a1
> row2   e2    f2
> row3   e3    f3
> row4   e4    a1
> row5   e5    a2
> r1 = table('r1');
> r2 = table('r2');
> r3 = r1.join(r1.a = r2.f) and r2;
> ---------------------------------------------
> temp table T : Sorted set by "f"
>     row       
> =============
> a1  row:row1  
>     row:row4  
> a2  row:row5 
> f2  row:row2 
> f3  row:row3 
> ---------------------
> r3
>            r1.row   a    b    c   r2.row    e   f  
> ===================================================
> row1.row1  row1     a1   b1   c1  row1      e1  a1
> row1.row4  row1     a1   b1   c1  row4      e4  a1
> row2.row5  row2     a2   b2   c2  row5      e5  a2
> row3.row1  row3     a1   b3   c3  row1      e1  a1
> row3.row4  row3     a1   b3   c3  row4      e4  a1
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-2021) Sort Join Implementation

Posted by "Edward Yoon (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-2021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Edward Yoon updated HADOOP-2021:
--------------------------------

    Description: 
If we don't have an index for a domain in the join, we can still improve on the nested-loop join using sort join.

{code}
R1 = table('movieLog_table');
R2 = table('stockCompany_info');
result = R1.join(R1.studioName = R2.corporation) and R2;
{code}

----
{code}
r1
       a     b    c
======================
row1   a1    b1   c1
row2   a2    b2   c2
row3   a1    b3   c3

r2
       e     f
==================
row1   e1    a1
row2   e2    f2
row3   e3    f3
row4   e4    a1
row5   e5    a2

r1 = table('r1');
r2 = table('r2');
r3 = r1.join(r1.a = r2.f) and r2;

---------------------------------------------
temp table T : Sorted set by "f"

    row       e     f
===========================
a1  row:row1  e1    a1
    row:row4  e4    
a2  row:row5  e5    a2
f2  row:row2  e2    f2
f3  row:row3  e3    f3
---------------------------------------------

r3
           r1.row   a    b    c   r2.row    e   f  
===================================================
row1.row1  row1     a1   b1   c1  row1      e1  a1
row1.row4  row1     a1   b1   c1  row4      e4  a1
row2.row5  row2     a2   b2   c2  row5      e5  a2
row3.row1  row3     a1   b3   c3  row1      e1  a1
row3.row4  row3     a1   b3   c3  row4      e4  a1
{code}





  was:
If we don't have an index for a domain in the join, we can still improve on the nested-loop join using sort join.

1. make a sorted set temp file for sort join using MR job 
2. make a new Relation table on hbase

{code}
R1 = table('movieLog_table');
R2 = table('stockCompany_info');
result = R1.join(R1.studioName = R2.corporation) and R2;
{code}

----
{code}
r1
       a     b    c
======================
row1   a1    b1   c1
row2   a2    b2   c2
row3   a1    b3   c3

r2
       e     f
==================
row1   e1    a1
row2   e2    f2
row3   e3    f3
row4   e4    a1
row5   e5    a2

r1 = table('r1');
r2 = table('r2');
r3 = r1.join(r1.a = r2.f) and r2;

---------------------------------------------
temp table T : Sorted set by "f"

    row       e     f
===========================
a1  row:row1  e1    a1
    row:row4  e4    
a2  row:row5  e5    a2
f2  row:row2  e2    f2
f3  row:row3  e3    f3
---------------------------------------------

r3
           r1.row   a    b    c   r2.row    e   f  
===================================================
row1.row1  row1     a1   b1   c1  row1      e1  a1
row1.row4  row1     a1   b1   c1  row4      e4  a1
row2.row5  row2     a2   b2   c2  row5      e5  a2
row3.row1  row3     a1   b3   c3  row1      e1  a1
row3.row4  row3     a1   b3   c3  row4      e4  a1
{code}






> Sort Join Implementation
> ------------------------
>
>                 Key: HADOOP-2021
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2021
>             Project: Hadoop
>          Issue Type: Sub-task
>          Components: contrib/hbase
>    Affects Versions: 0.14.1
>         Environment: all environments  
>            Reporter: Edward Yoon
>            Priority: Minor
>             Fix For: 0.16.0
>
>         Attachments: 2021_v01.patch, 2021_v02.txt
>
>
> If we don't have an index for a domain in the join, we can still improve on the nested-loop join using sort join.
> {code}
> R1 = table('movieLog_table');
> R2 = table('stockCompany_info');
> result = R1.join(R1.studioName = R2.corporation) and R2;
> {code}
> ----
> {code}
> r1
>        a     b    c
> ======================
> row1   a1    b1   c1
> row2   a2    b2   c2
> row3   a1    b3   c3
> r2
>        e     f
> ==================
> row1   e1    a1
> row2   e2    f2
> row3   e3    f3
> row4   e4    a1
> row5   e5    a2
> r1 = table('r1');
> r2 = table('r2');
> r3 = r1.join(r1.a = r2.f) and r2;
> ---------------------------------------------
> temp table T : Sorted set by "f"
>     row       e     f
> ===========================
> a1  row:row1  e1    a1
>     row:row4  e4    
> a2  row:row5  e5    a2
> f2  row:row2  e2    f2
> f3  row:row3  e3    f3
> ---------------------------------------------
> r3
>            r1.row   a    b    c   r2.row    e   f  
> ===================================================
> row1.row1  row1     a1   b1   c1  row1      e1  a1
> row1.row4  row1     a1   b1   c1  row4      e4  a1
> row2.row5  row2     a2   b2   c2  row5      e5  a2
> row3.row1  row3     a1   b3   c3  row1      e1  a1
> row3.row4  row3     a1   b3   c3  row4      e4  a1
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-2021) Sort Join Implementation

Posted by "Edward Yoon (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-2021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Edward Yoon updated HADOOP-2021:
--------------------------------

    Status: Open  (was: Patch Available)

> Sort Join Implementation
> ------------------------
>
>                 Key: HADOOP-2021
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2021
>             Project: Hadoop
>          Issue Type: Sub-task
>          Components: contrib/hbase
>    Affects Versions: 0.14.1
>         Environment: all environments  
>            Reporter: Edward Yoon
>            Assignee: Edward Yoon
>            Priority: Minor
>             Fix For: 0.16.0
>
>         Attachments: 2021_v01.patch, 2021_v02.txt, 2021_v04.patch, 2021_v05, 2021_v05.patch
>
>
> If we don't have an index for a domain in the join, we can still improve on the nested-loop join using sort join.
> {code}
> R1 = table('movieLog_table');
> R2 = table('stockCompany_info');
> result = R1.join(R1.studioName = R2.corporation) and R2;
> {code}
> ----
> {code}
> r1
>        a     b    c
> ======================
> row1   a1    b1   c1
> row2   a2    b2   c2
> row3   a1    b3   c3
> r2
>        e     f
> ==================
> row1   e1    a1
> row2   e2    f2
> row3   e3    f3
> row4   e4    a1
> row5   e5    a2
> r1 = table('r1');
> r2 = table('r2');
> r3 = r1.join(r1.a = r2.f) and r2;
> ---------------------------------------------
> temp table T : Sorted set by "f"
>     row       e     f
> ===========================
> a1  row:row1  e1    a1
>     row:row4  e4    
> a2  row:row5  e5    a2
> f2  row:row2  e2    f2
> f3  row:row3  e3    f3
> ---------------------------------------------
> r3
>            r1.row   a    b    c   r2.row    e   f  
> ===================================================
> row1.row1  row1     a1   b1   c1  row1      e1  a1
> row1.row4  row1     a1   b1   c1  row4      e4  a1
> row2.row5  row2     a2   b2   c2  row5      e5  a2
> row3.row1  row3     a1   b3   c3  row1      e1  a1
> row3.row4  row3     a1   b3   c3  row4      e4  a1
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-2021) Sort Join Implementation

Posted by "Edward Yoon (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-2021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Edward Yoon updated HADOOP-2021:
--------------------------------

    Attachment: 2021_v06.patch

-added some comments.

It was used little different way of join by mapreduce processing twice.
I think we discuss about parallel sort merge join later.

> Sort Join Implementation
> ------------------------
>
>                 Key: HADOOP-2021
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2021
>             Project: Hadoop
>          Issue Type: Sub-task
>          Components: contrib/hbase
>    Affects Versions: 0.14.1
>         Environment: all environments  
>            Reporter: Edward Yoon
>            Assignee: Edward Yoon
>            Priority: Minor
>             Fix For: 0.16.0
>
>         Attachments: 2021_v01.patch, 2021_v02.txt, 2021_v04.patch, 2021_v05, 2021_v05.patch, 2021_v06.patch
>
>
> If we don't have an index for a domain in the join, we can still improve on the nested-loop join using sort join.
> {code}
> R1 = table('movieLog_table');
> R2 = table('stockCompany_info');
> result = R1.join(R1.studioName = R2.corporation) and R2;
> {code}
> ----
> {code}
> r1
>        a     b    c
> ======================
> row1   a1    b1   c1
> row2   a2    b2   c2
> row3   a1    b3   c3
> r2
>        e     f
> ==================
> row1   e1    a1
> row2   e2    f2
> row3   e3    f3
> row4   e4    a1
> row5   e5    a2
> r1 = table('r1');
> r2 = table('r2');
> r3 = r1.join(r1.a = r2.f) and r2;
> ---------------------------------------------
> temp table T : Sorted set by "f"
>     row       e     f
> ===========================
> a1  row:row1  e1    a1
>     row:row4  e4    
> a2  row:row5  e5    a2
> f2  row:row2  e2    f2
> f3  row:row3  e3    f3
> ---------------------------------------------
> r3
>            r1.row   a    b    c   r2.row    e   f  
> ===================================================
> row1.row1  row1     a1   b1   c1  row1      e1  a1
> row1.row4  row1     a1   b1   c1  row4      e4  a1
> row2.row5  row2     a2   b2   c2  row5      e5  a2
> row3.row1  row3     a1   b3   c3  row1      e1  a1
> row3.row4  row3     a1   b3   c3  row4      e4  a1
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-2021) Sort Join Implementation

Posted by "Edward Yoon (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-2021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12543331 ] 

Edward Yoon commented on HADOOP-2021:
-------------------------------------

r1
       a     b    c
================
row1   a1    b1   c1
row2   a2    b2   c2

r2
       e     f
============
row1   e1    a1
row2   e2    f2
row3   e3    f3
row4   e4    a1

{code}
r1 = table('r1');
r2 = table('r2');
r3 = r1.join(r1.a = r2.f) and r2;
{code}

r3
      a    b    c   row    e   f  
=========================
row1  a1   b1   c1  row1  e1  a1
row1  a1   b1   c1  row4  e4  a1


> Sort Join Implementation
> ------------------------
>
>                 Key: HADOOP-2021
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2021
>             Project: Hadoop
>          Issue Type: Sub-task
>          Components: contrib/hbase
>    Affects Versions: 0.14.1
>         Environment: all environments  
>            Reporter: Edward Yoon
>            Priority: Minor
>             Fix For: 0.16.0
>
>         Attachments: 2021_v01.patch
>
>
> If we don't have an index for a domain in the join, we can still improve on the nested-loop join using sort join.
> {code}
> R1 = table('movieLog_table');
> R2 = table('stockCompany_info');
> result = R1.join(R1.studioName = R2.corporation) and R2;
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-2021) Sort Join Implementation

Posted by "Edward Yoon (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-2021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Edward Yoon updated HADOOP-2021:
--------------------------------

    Description: 
If we don't have an index for a domain in the join, we can still improve on the nested-loop join using sort join.

1. make a sorted set temp file for sort join using MR job 
2. make a new Relation table on hbase

{code}
R1 = table('movieLog_table');
R2 = table('stockCompany_info');
result = R1.join(R1.studioName = R2.corporation) and R2;
{code}

----
{code}
r1
       a     b    c
======================
row1   a1    b1   c1
row2   a2    b2   c2
row3   a1    b3   c3

r2
       e     f
==================
row1   e1    a1
row2   e2    f2
row3   e3    f3
row4   e4    a1
row5   e5    a2

r1 = table('r1');
r2 = table('r2');
r3 = r1.join(r1.a = r2.f) and r2;

---------------------------------------------
temp table T : Sorted set by "f"

    row       e     f
===========================
a1  row:row1  e1    a1
    row:row4  e4    
a2  row:row5  e5    a2
f2  row:row2  e2    f2
f3  row:row3  e3    f3
---------------------------------------------

r3
           r1.row   a    b    c   r2.row    e   f  
===================================================
row1.row1  row1     a1   b1   c1  row1      e1  a1
row1.row4  row1     a1   b1   c1  row4      e4  a1
row2.row5  row2     a2   b2   c2  row5      e5  a2
row3.row1  row3     a1   b3   c3  row1      e1  a1
row3.row4  row3     a1   b3   c3  row4      e4  a1
{code}





  was:
If we don't have an index for a domain in the join, we can still improve on the nested-loop join using sort join.

1. make a sorted set temp file for sort join using MR job on hdfs
2. make a new Relation table on hbase

{code}
R1 = table('movieLog_table');
R2 = table('stockCompany_info');
result = R1.join(R1.studioName = R2.corporation) and R2;
{code}



update description.

> Sort Join Implementation
> ------------------------
>
>                 Key: HADOOP-2021
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2021
>             Project: Hadoop
>          Issue Type: Sub-task
>          Components: contrib/hbase
>    Affects Versions: 0.14.1
>         Environment: all environments  
>            Reporter: Edward Yoon
>            Priority: Minor
>             Fix For: 0.16.0
>
>         Attachments: 2021_v01.patch
>
>
> If we don't have an index for a domain in the join, we can still improve on the nested-loop join using sort join.
> 1. make a sorted set temp file for sort join using MR job 
> 2. make a new Relation table on hbase
> {code}
> R1 = table('movieLog_table');
> R2 = table('stockCompany_info');
> result = R1.join(R1.studioName = R2.corporation) and R2;
> {code}
> ----
> {code}
> r1
>        a     b    c
> ======================
> row1   a1    b1   c1
> row2   a2    b2   c2
> row3   a1    b3   c3
> r2
>        e     f
> ==================
> row1   e1    a1
> row2   e2    f2
> row3   e3    f3
> row4   e4    a1
> row5   e5    a2
> r1 = table('r1');
> r2 = table('r2');
> r3 = r1.join(r1.a = r2.f) and r2;
> ---------------------------------------------
> temp table T : Sorted set by "f"
>     row       e     f
> ===========================
> a1  row:row1  e1    a1
>     row:row4  e4    
> a2  row:row5  e5    a2
> f2  row:row2  e2    f2
> f3  row:row3  e3    f3
> ---------------------------------------------
> r3
>            r1.row   a    b    c   r2.row    e   f  
> ===================================================
> row1.row1  row1     a1   b1   c1  row1      e1  a1
> row1.row4  row1     a1   b1   c1  row4      e4  a1
> row2.row5  row2     a2   b2   c2  row5      e5  a2
> row3.row1  row3     a1   b3   c3  row1      e1  a1
> row3.row4  row3     a1   b3   c3  row4      e4  a1
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-2021) Sort Join Implementation

Posted by "Edward Yoon (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-2021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Edward Yoon updated HADOOP-2021:
--------------------------------

    Description: 
If we don't have an index for a domain in the join, we can still improve on the nested-loop join using sort join.

1. make a sorted set temp file for sort join using MR job
2. make a new Relation table

{code}
R1 = table('movieLog_table');
R2 = table('stockCompany_info');
result = R1.join(R1.studioName = R2.corporation) and R2;
{code}


  was:
If we don't have an index for a domain in the join, we can still improve on the nested-loop join using sort join.

{code}
R1 = table('movieLog_table');
R2 = table('stockCompany_info');
result = R1.join(R1.studioName = R2.corporation) and R2;
{code}



update description.

> Sort Join Implementation
> ------------------------
>
>                 Key: HADOOP-2021
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2021
>             Project: Hadoop
>          Issue Type: Sub-task
>          Components: contrib/hbase
>    Affects Versions: 0.14.1
>         Environment: all environments  
>            Reporter: Edward Yoon
>            Priority: Minor
>             Fix For: 0.16.0
>
>         Attachments: 2021_v01.patch
>
>
> If we don't have an index for a domain in the join, we can still improve on the nested-loop join using sort join.
> 1. make a sorted set temp file for sort join using MR job
> 2. make a new Relation table
> {code}
> R1 = table('movieLog_table');
> R2 = table('stockCompany_info');
> result = R1.join(R1.studioName = R2.corporation) and R2;
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-2021) Sort Join Implementation

Posted by "Edward Yoon (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-2021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Edward Yoon updated HADOOP-2021:
--------------------------------

    Attachment: 2021_v02.txt

I found a bug in TableMap, TableReduce classes.
Row iterator makes a lot of duplicated qualifier of column.

> Sort Join Implementation
> ------------------------
>
>                 Key: HADOOP-2021
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2021
>             Project: Hadoop
>          Issue Type: Sub-task
>          Components: contrib/hbase
>    Affects Versions: 0.14.1
>         Environment: all environments  
>            Reporter: Edward Yoon
>            Priority: Minor
>             Fix For: 0.16.0
>
>         Attachments: 2021_v01.patch, 2021_v02.txt
>
>
> If we don't have an index for a domain in the join, we can still improve on the nested-loop join using sort join.
> 1. make a sorted set temp file for sort join using MR job 
> 2. make a new Relation table on hbase
> {code}
> R1 = table('movieLog_table');
> R2 = table('stockCompany_info');
> result = R1.join(R1.studioName = R2.corporation) and R2;
> {code}
> ----
> {code}
> r1
>        a     b    c
> ======================
> row1   a1    b1   c1
> row2   a2    b2   c2
> row3   a1    b3   c3
> r2
>        e     f
> ==================
> row1   e1    a1
> row2   e2    f2
> row3   e3    f3
> row4   e4    a1
> row5   e5    a2
> r1 = table('r1');
> r2 = table('r2');
> r3 = r1.join(r1.a = r2.f) and r2;
> ---------------------------------------------
> temp table T : Sorted set by "f"
>     row       e     f
> ===========================
> a1  row:row1  e1    a1
>     row:row4  e4    
> a2  row:row5  e5    a2
> f2  row:row2  e2    f2
> f3  row:row3  e3    f3
> ---------------------------------------------
> r3
>            r1.row   a    b    c   r2.row    e   f  
> ===================================================
> row1.row1  row1     a1   b1   c1  row1      e1  a1
> row1.row4  row1     a1   b1   c1  row4      e4  a1
> row2.row5  row2     a2   b2   c2  row5      e5  a2
> row3.row1  row3     a1   b3   c3  row1      e1  a1
> row3.row4  row3     a1   b3   c3  row4      e4  a1
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-2021) Sort Join Implementation

Posted by "Edward Yoon (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-2021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Edward Yoon updated HADOOP-2021:
--------------------------------

    Attachment: 2021_v05.patch

{code}
r1 = table('r1');
r2 = table('r2');
r3 = r1.join(r1.a = r2.f and r1.a = 'a2') and r2;
save r3 into table('result');
{code}

> Sort Join Implementation
> ------------------------
>
>                 Key: HADOOP-2021
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2021
>             Project: Hadoop
>          Issue Type: Sub-task
>          Components: contrib/hbase
>    Affects Versions: 0.14.1
>         Environment: all environments  
>            Reporter: Edward Yoon
>            Assignee: Edward Yoon
>            Priority: Minor
>             Fix For: 0.16.0
>
>         Attachments: 2021_v01.patch, 2021_v02.txt, 2021_v04.patch, 2021_v05, 2021_v05.patch
>
>
> If we don't have an index for a domain in the join, we can still improve on the nested-loop join using sort join.
> {code}
> R1 = table('movieLog_table');
> R2 = table('stockCompany_info');
> result = R1.join(R1.studioName = R2.corporation) and R2;
> {code}
> ----
> {code}
> r1
>        a     b    c
> ======================
> row1   a1    b1   c1
> row2   a2    b2   c2
> row3   a1    b3   c3
> r2
>        e     f
> ==================
> row1   e1    a1
> row2   e2    f2
> row3   e3    f3
> row4   e4    a1
> row5   e5    a2
> r1 = table('r1');
> r2 = table('r2');
> r3 = r1.join(r1.a = r2.f) and r2;
> ---------------------------------------------
> temp table T : Sorted set by "f"
>     row       e     f
> ===========================
> a1  row:row1  e1    a1
>     row:row4  e4    
> a2  row:row5  e5    a2
> f2  row:row2  e2    f2
> f3  row:row3  e3    f3
> ---------------------------------------------
> r3
>            r1.row   a    b    c   r2.row    e   f  
> ===================================================
> row1.row1  row1     a1   b1   c1  row1      e1  a1
> row1.row4  row1     a1   b1   c1  row4      e4  a1
> row2.row5  row2     a2   b2   c2  row5      e5  a2
> row3.row1  row3     a1   b3   c3  row1      e1  a1
> row3.row4  row3     a1   b3   c3  row4      e4  a1
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.