You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Edward Yoon (JIRA)" <ji...@apache.org> on 2007/10/10 10:05:50 UTC
[jira] Created: (HADOOP-2021) θ Join Condition
θ Join Condition
----------------
Key: HADOOP-2021
URL: https://issues.apache.org/jira/browse/HADOOP-2021
Project: Hadoop
Issue Type: Sub-task
Reporter: Edward Yoon
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
TableMap and TableReduce in hbase.
Posted by edward yoon <we...@udanax.org>.
I was found a bug in TableMap and TableReduce.
The bug makes a lot of duplicated qualifier of column.
------------------------------
B. Regards,
Edward yoon @ NHN, corp.
Home : http://www.udanax.org
_________________________________________________________________
Share life as it happens with the new Windows Live.Download today it's FREE!
http://www.windowslive.com/share.html?ocid=TXT_TAGLM_Wave2_sharelife_112007
[jira] Issue Comment Edited: (HADOOP-2021) Sort Join Implementation
Posted by "Edward Yoon (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-2021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12543331 ]
udanax edited comment on HADOOP-2021 at 11/17/07 7:16 PM:
---------------------------------------------------------------
{code}
r1
a b c
======================
row1 a1 b1 c1
row2 a2 b2 c2
r2
e f
==================
row1 e1 a1
row2 e2 f2
row3 e3 f3
row4 e4 a1
r1 = table('r1');
r2 = table('r2');
r3 = r1.join(r1.a = r2.f) and r2;
r3
a b c row e f
=====================================
row1 a1 b1 c1 row1 e1 a1
row1 a1 b1 c1 row4 e4 a1
{code}
was (Author: udanax):
r1
a b c
================
row1 a1 b1 c1
row2 a2 b2 c2
r2
e f
============
row1 e1 a1
row2 e2 f2
row3 e3 f3
row4 e4 a1
{code}
r1 = table('r1');
r2 = table('r2');
r3 = r1.join(r1.a = r2.f) and r2;
{code}
r3
a b c row e f
=========================
row1 a1 b1 c1 row1 e1 a1
row1 a1 b1 c1 row4 e4 a1
> Sort Join Implementation
> ------------------------
>
> Key: HADOOP-2021
> URL: https://issues.apache.org/jira/browse/HADOOP-2021
> Project: Hadoop
> Issue Type: Sub-task
> Components: contrib/hbase
> Affects Versions: 0.14.1
> Environment: all environments
> Reporter: Edward Yoon
> Priority: Minor
> Fix For: 0.16.0
>
> Attachments: 2021_v01.patch
>
>
> If we don't have an index for a domain in the join, we can still improve on the nested-loop join using sort join.
> {code}
> R1 = table('movieLog_table');
> R2 = table('stockCompany_info');
> result = R1.join(R1.studioName = R2.corporation) and R2;
> {code}
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Issue Comment Edited: (HADOOP-2021) Sort Join Implementation
Posted by "Edward Yoon (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-2021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12547037 ]
udanax edited comment on HADOOP-2021 at 11/30/07 12:18 AM:
----------------------------------------------------------------
-added some comments.
I wrote some different way of join by mapreduce processing twice.
I think we discuss about parallel sort merge join later.
was (Author: udanax):
-added some comments.
It was used little different way of join by mapreduce processing twice.
I think we discuss about parallel sort merge join later.
> Sort Join Implementation
> ------------------------
>
> Key: HADOOP-2021
> URL: https://issues.apache.org/jira/browse/HADOOP-2021
> Project: Hadoop
> Issue Type: Sub-task
> Components: contrib/hbase
> Affects Versions: 0.14.1
> Environment: all environments
> Reporter: Edward Yoon
> Assignee: Edward Yoon
> Priority: Minor
> Fix For: 0.16.0
>
> Attachments: 2021_v01.patch, 2021_v02.txt, 2021_v04.patch, 2021_v05, 2021_v05.patch, 2021_v06.patch
>
>
> If we don't have an index for a domain in the join, we can still improve on the nested-loop join using sort join.
> {code}
> R1 = table('movieLog_table');
> R2 = table('stockCompany_info');
> result = R1.join(R1.studioName = R2.corporation) and R2;
> {code}
> ----
> {code}
> r1
> a b c
> ======================
> row1 a1 b1 c1
> row2 a2 b2 c2
> row3 a1 b3 c3
> r2
> e f
> ==================
> row1 e1 a1
> row2 e2 f2
> row3 e3 f3
> row4 e4 a1
> row5 e5 a2
> r1 = table('r1');
> r2 = table('r2');
> r3 = r1.join(r1.a = r2.f) and r2;
> ---------------------------------------------
> temp table T : Sorted set by "f"
> row e f
> ===========================
> a1 row:row1 e1 a1
> row:row4 e4
> a2 row:row5 e5 a2
> f2 row:row2 e2 f2
> f3 row:row3 e3 f3
> ---------------------------------------------
> r3
> r1.row a b c r2.row e f
> ===================================================
> row1.row1 row1 a1 b1 c1 row1 e1 a1
> row1.row4 row1 a1 b1 c1 row4 e4 a1
> row2.row5 row2 a2 b2 c2 row5 e5 a2
> row3.row1 row3 a1 b3 c3 row1 e1 a1
> row3.row4 row3 a1 b3 c3 row4 e4 a1
> {code}
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (HADOOP-2021) Sort Join Implementation
Posted by "Edward Yoon (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-2021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12543333 ]
Edward Yoon commented on HADOOP-2021:
-------------------------------------
But, Can not insert duplicate key row in hbase table.
so........
r3
a b c e f
===========================
row1.row4 a1 b1 c1 e1 a1
row1.row4 a1 b1 c1 e4 a1
hmm.
is it all right in theory ??
I need a any advice.
> Sort Join Implementation
> ------------------------
>
> Key: HADOOP-2021
> URL: https://issues.apache.org/jira/browse/HADOOP-2021
> Project: Hadoop
> Issue Type: Sub-task
> Components: contrib/hbase
> Affects Versions: 0.14.1
> Environment: all environments
> Reporter: Edward Yoon
> Priority: Minor
> Fix For: 0.16.0
>
> Attachments: 2021_v01.patch
>
>
> If we don't have an index for a domain in the join, we can still improve on the nested-loop join using sort join.
> {code}
> R1 = table('movieLog_table');
> R2 = table('stockCompany_info');
> result = R1.join(R1.studioName = R2.corporation) and R2;
> {code}
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (HADOOP-2021) [Hbase Shell] Sort Join
Implementation
Posted by "Edward Yoon (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-2021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Edward Yoon updated HADOOP-2021:
--------------------------------
Status: Open (was: Patch Available)
> [Hbase Shell] Sort Join Implementation
> --------------------------------------
>
> Key: HADOOP-2021
> URL: https://issues.apache.org/jira/browse/HADOOP-2021
> Project: Hadoop
> Issue Type: Improvement
> Components: contrib/hbase
> Affects Versions: 0.16.0
> Environment: all environments
> Reporter: Edward Yoon
> Assignee: Edward Yoon
> Fix For: 0.16.0
>
> Attachments: 2021_v01.patch, 2021_v02.txt, 2021_v04.patch, 2021_v05, 2021_v05.patch, 2021_v06.patch
>
>
> If we don't have an index for a domain in the join, we can still improve on the nested-loop join using sort join.
> {code}
> R1 = table('movieLog_table');
> R2 = table('stockCompany_info');
> result = R1.join(R1.studioName = R2.corporation) and R2;
> {code}
> ----
> {code}
> r1
> a b c
> ======================
> row1 a1 b1 c1
> row2 a2 b2 c2
> row3 a1 b3 c3
> r2
> e f
> ==================
> row1 e1 a1
> row2 e2 f2
> row3 e3 f3
> row4 e4 a1
> row5 e5 a2
> r1 = table('r1');
> r2 = table('r2');
> r3 = r1.join(r1.a = r2.f) and r2;
> ---------------------------------------------
> temp table T : Sorted set by "f"
> row
> =============
> a1 row:row1
> row:row4
> a2 row:row5
> f2 row:row2
> f3 row:row3
> ---------------------
> r3
> r1.row a b c r2.row e f
> ===================================================
> row1.row1 row1 a1 b1 c1 row1 e1 a1
> row1.row4 row1 a1 b1 c1 row4 e4 a1
> row2.row5 row2 a2 b2 c2 row5 e5 a2
> row3.row1 row3 a1 b3 c3 row1 e1 a1
> row3.row4 row3 a1 b3 c3 row4 e4 a1
> {code}
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Work started: (HADOOP-2021) Sort Join Implementation
Posted by "Edward Yoon (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-2021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Work on HADOOP-2021 started by Edward Yoon.
> Sort Join Implementation
> ------------------------
>
> Key: HADOOP-2021
> URL: https://issues.apache.org/jira/browse/HADOOP-2021
> Project: Hadoop
> Issue Type: Improvement
> Components: contrib/hbase
> Affects Versions: 0.15.0
> Environment: all environments
> Reporter: Edward Yoon
> Assignee: Edward Yoon
> Priority: Minor
> Fix For: 0.16.0
>
> Attachments: 2021_v01.patch, 2021_v02.txt, 2021_v04.patch, 2021_v05, 2021_v05.patch, 2021_v06.patch
>
>
> If we don't have an index for a domain in the join, we can still improve on the nested-loop join using sort join.
> {code}
> R1 = table('movieLog_table');
> R2 = table('stockCompany_info');
> result = R1.join(R1.studioName = R2.corporation) and R2;
> {code}
> ----
> {code}
> r1
> a b c
> ======================
> row1 a1 b1 c1
> row2 a2 b2 c2
> row3 a1 b3 c3
> r2
> e f
> ==================
> row1 e1 a1
> row2 e2 f2
> row3 e3 f3
> row4 e4 a1
> row5 e5 a2
> r1 = table('r1');
> r2 = table('r2');
> r3 = r1.join(r1.a = r2.f) and r2;
> ---------------------------------------------
> temp table T : Sorted set by "f"
> row
> =============
> a1 row:row1
> row:row4
> a2 row:row5
> f2 row:row2
> f3 row:row3
> ---------------------
> r3
> r1.row a b c r2.row e f
> ===================================================
> row1.row1 row1 a1 b1 c1 row1 e1 a1
> row1.row4 row1 a1 b1 c1 row4 e4 a1
> row2.row5 row2 a2 b2 c2 row5 e5 a2
> row3.row1 row3 a1 b3 c3 row1 e1 a1
> row3.row4 row3 a1 b3 c3 row4 e4 a1
> {code}
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (HADOOP-2021) [Hbase Shell] Sort Join
Implementation
Posted by "Edward Yoon (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-2021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Edward Yoon updated HADOOP-2021:
--------------------------------
Resolution: Duplicate
Status: Resolved (was: Patch Available)
This issue seems broken.
I move to HADOOP-2328.
> [Hbase Shell] Sort Join Implementation
> --------------------------------------
>
> Key: HADOOP-2021
> URL: https://issues.apache.org/jira/browse/HADOOP-2021
> Project: Hadoop
> Issue Type: Improvement
> Components: contrib/hbase
> Affects Versions: 0.16.0
> Environment: all environments
> Reporter: Edward Yoon
> Assignee: Edward Yoon
> Fix For: 0.16.0
>
> Attachments: 2021_v01.patch, 2021_v02.txt, 2021_v04.patch, 2021_v05, 2021_v05.patch, 2021_v06.patch
>
>
> If we don't have an index for a domain in the join, we can still improve on the nested-loop join using sort join.
> {code}
> R1 = table('movieLog_table');
> R2 = table('stockCompany_info');
> result = R1.join(R1.studioName = R2.corporation) and R2;
> {code}
> ----
> {code}
> r1
> a b c
> ======================
> row1 a1 b1 c1
> row2 a2 b2 c2
> row3 a1 b3 c3
> r2
> e f
> ==================
> row1 e1 a1
> row2 e2 f2
> row3 e3 f3
> row4 e4 a1
> row5 e5 a2
> r1 = table('r1');
> r2 = table('r2');
> r3 = r1.join(r1.a = r2.f) and r2;
> ---------------------------------------------
> temp table T : Sorted set by "f"
> row
> =============
> a1 row:row1
> row:row4
> a2 row:row5
> f2 row:row2
> f3 row:row3
> ---------------------
> r3
> r1.row a b c r2.row e f
> ===================================================
> row1.row1 row1 a1 b1 c1 row1 e1 a1
> row1.row4 row1 a1 b1 c1 row4 e4 a1
> row2.row5 row2 a2 b2 c2 row5 e5 a2
> row3.row1 row3 a1 b3 c3 row1 e1 a1
> row3.row4 row3 a1 b3 c3 row4 e4 a1
> {code}
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Issue Comment Edited: (HADOOP-2021) Sort Join Implementation
Posted by "Edward Yoon (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-2021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12543333 ]
udanax edited comment on HADOOP-2021 at 11/18/07 2:04 AM:
---------------------------------------------------------------
But, Can not insert duplicate key row in hbase table.
so........
{code}
r3
a b c e f
==============================
row1.row1 a1 b1 c1 e1 a1
row1.row4 a1 b1 c1 e4 a1
{code}
Then, Can i get a key set of r1 table or r2 table?
hmm.
Also, is it all right in theory ??
If you have any ideas, let me know.
was (Author: udanax):
But, Can not insert duplicate key row in hbase table.
so........
{code}
r3
a b c e f
==============================
row1.row1 a1 b1 c1 e1 a1
row1.row4 a1 b1 c1 e4 a1
{code}
Then, Can i get a r1.row set or r2.row set??
hmm.
Also, is it all right in theory ??
If you have any ideas, let me know.
> Sort Join Implementation
> ------------------------
>
> Key: HADOOP-2021
> URL: https://issues.apache.org/jira/browse/HADOOP-2021
> Project: Hadoop
> Issue Type: Sub-task
> Components: contrib/hbase
> Affects Versions: 0.14.1
> Environment: all environments
> Reporter: Edward Yoon
> Priority: Minor
> Fix For: 0.16.0
>
> Attachments: 2021_v01.patch
>
>
> If we don't have an index for a domain in the join, we can still improve on the nested-loop join using sort join.
> {code}
> R1 = table('movieLog_table');
> R2 = table('stockCompany_info');
> result = R1.join(R1.studioName = R2.corporation) and R2;
> {code}
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (HADOOP-2021) [Hbase Shell] Sort Join
Implementation
Posted by "Edward Yoon (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-2021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Edward Yoon updated HADOOP-2021:
--------------------------------
Due Date: 01/Dec/07
Priority: Major (was: Minor)
Summary: [Hbase Shell] Sort Join Implementation (was: Sort Join Implementation)
> [Hbase Shell] Sort Join Implementation
> --------------------------------------
>
> Key: HADOOP-2021
> URL: https://issues.apache.org/jira/browse/HADOOP-2021
> Project: Hadoop
> Issue Type: Improvement
> Components: contrib/hbase
> Affects Versions: 0.15.0
> Environment: all environments
> Reporter: Edward Yoon
> Assignee: Edward Yoon
> Fix For: 0.16.0
>
> Attachments: 2021_v01.patch, 2021_v02.txt, 2021_v04.patch, 2021_v05, 2021_v05.patch, 2021_v06.patch
>
>
> If we don't have an index for a domain in the join, we can still improve on the nested-loop join using sort join.
> {code}
> R1 = table('movieLog_table');
> R2 = table('stockCompany_info');
> result = R1.join(R1.studioName = R2.corporation) and R2;
> {code}
> ----
> {code}
> r1
> a b c
> ======================
> row1 a1 b1 c1
> row2 a2 b2 c2
> row3 a1 b3 c3
> r2
> e f
> ==================
> row1 e1 a1
> row2 e2 f2
> row3 e3 f3
> row4 e4 a1
> row5 e5 a2
> r1 = table('r1');
> r2 = table('r2');
> r3 = r1.join(r1.a = r2.f) and r2;
> ---------------------------------------------
> temp table T : Sorted set by "f"
> row
> =============
> a1 row:row1
> row:row4
> a2 row:row5
> f2 row:row2
> f3 row:row3
> ---------------------
> r3
> r1.row a b c r2.row e f
> ===================================================
> row1.row1 row1 a1 b1 c1 row1 e1 a1
> row1.row4 row1 a1 b1 c1 row4 e4 a1
> row2.row5 row2 a2 b2 c2 row5 e5 a2
> row3.row1 row3 a1 b3 c3 row1 e1 a1
> row3.row4 row3 a1 b3 c3 row4 e4 a1
> {code}
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (HADOOP-2021) Sort Join Implementation
Posted by "Edward Yoon (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-2021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12544187 ]
Edward Yoon commented on HADOOP-2021:
-------------------------------------
The bug is reported in HADOOP-2244 and HADOOP-2234.
> Sort Join Implementation
> ------------------------
>
> Key: HADOOP-2021
> URL: https://issues.apache.org/jira/browse/HADOOP-2021
> Project: Hadoop
> Issue Type: Sub-task
> Components: contrib/hbase
> Affects Versions: 0.14.1
> Environment: all environments
> Reporter: Edward Yoon
> Priority: Minor
> Fix For: 0.16.0
>
> Attachments: 2021_v01.patch, 2021_v02.txt
>
>
> If we don't have an index for a domain in the join, we can still improve on the nested-loop join using sort join.
> {code}
> R1 = table('movieLog_table');
> R2 = table('stockCompany_info');
> result = R1.join(R1.studioName = R2.corporation) and R2;
> {code}
> ----
> {code}
> r1
> a b c
> ======================
> row1 a1 b1 c1
> row2 a2 b2 c2
> row3 a1 b3 c3
> r2
> e f
> ==================
> row1 e1 a1
> row2 e2 f2
> row3 e3 f3
> row4 e4 a1
> row5 e5 a2
> r1 = table('r1');
> r2 = table('r2');
> r3 = r1.join(r1.a = r2.f) and r2;
> ---------------------------------------------
> temp table T : Sorted set by "f"
> row e f
> ===========================
> a1 row:row1 e1 a1
> row:row4 e4
> a2 row:row5 e5 a2
> f2 row:row2 e2 f2
> f3 row:row3 e3 f3
> ---------------------------------------------
> r3
> r1.row a b c r2.row e f
> ===================================================
> row1.row1 row1 a1 b1 c1 row1 e1 a1
> row1.row4 row1 a1 b1 c1 row4 e4 a1
> row2.row5 row2 a2 b2 c2 row5 e5 a2
> row3.row1 row3 a1 b3 c3 row1 e1 a1
> row3.row4 row3 a1 b3 c3 row4 e4 a1
> {code}
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (HADOOP-2021) Sort Join Implementation
Posted by "Edward Yoon (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-2021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Edward Yoon updated HADOOP-2021:
--------------------------------
Status: Patch Available (was: Open)
submitting.
> Sort Join Implementation
> ------------------------
>
> Key: HADOOP-2021
> URL: https://issues.apache.org/jira/browse/HADOOP-2021
> Project: Hadoop
> Issue Type: Sub-task
> Components: contrib/hbase
> Affects Versions: 0.14.1
> Environment: all environments
> Reporter: Edward Yoon
> Assignee: Edward Yoon
> Priority: Minor
> Fix For: 0.16.0
>
> Attachments: 2021_v01.patch, 2021_v02.txt, 2021_v04.patch, 2021_v05, 2021_v05.patch
>
>
> If we don't have an index for a domain in the join, we can still improve on the nested-loop join using sort join.
> {code}
> R1 = table('movieLog_table');
> R2 = table('stockCompany_info');
> result = R1.join(R1.studioName = R2.corporation) and R2;
> {code}
> ----
> {code}
> r1
> a b c
> ======================
> row1 a1 b1 c1
> row2 a2 b2 c2
> row3 a1 b3 c3
> r2
> e f
> ==================
> row1 e1 a1
> row2 e2 f2
> row3 e3 f3
> row4 e4 a1
> row5 e5 a2
> r1 = table('r1');
> r2 = table('r2');
> r3 = r1.join(r1.a = r2.f) and r2;
> ---------------------------------------------
> temp table T : Sorted set by "f"
> row e f
> ===========================
> a1 row:row1 e1 a1
> row:row4 e4
> a2 row:row5 e5 a2
> f2 row:row2 e2 f2
> f3 row:row3 e3 f3
> ---------------------------------------------
> r3
> r1.row a b c r2.row e f
> ===================================================
> row1.row1 row1 a1 b1 c1 row1 e1 a1
> row1.row4 row1 a1 b1 c1 row4 e4 a1
> row2.row5 row2 a2 b2 c2 row5 e5 a2
> row3.row1 row3 a1 b3 c3 row1 e1 a1
> row3.row4 row3 a1 b3 c3 row4 e4 a1
> {code}
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Issue Comment Edited: (HADOOP-2021) Sort Join Implementation
Posted by "Edward Yoon (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-2021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12543333 ]
udanax edited comment on HADOOP-2021 at 11/17/07 7:19 PM:
---------------------------------------------------------------
But, Can not insert duplicate key row in hbase table.
so........
{code}
r3
a b c e f
==============================
row1.row4 a1 b1 c1 e1 a1
row1.row4 a1 b1 c1 e4 a1
{code}
hmm.
is it all right in theory ??
If you have any ideas, let me know.
was (Author: udanax):
But, Can not insert duplicate key row in hbase table.
so........
{code}
r3
a b c e f
==============================
row1.row4 a1 b1 c1 e1 a1
row1.row4 a1 b1 c1 e4 a1
{code}
hmm.
is it all right in theory ??
If you have a idea, let me know.
> Sort Join Implementation
> ------------------------
>
> Key: HADOOP-2021
> URL: https://issues.apache.org/jira/browse/HADOOP-2021
> Project: Hadoop
> Issue Type: Sub-task
> Components: contrib/hbase
> Affects Versions: 0.14.1
> Environment: all environments
> Reporter: Edward Yoon
> Priority: Minor
> Fix For: 0.16.0
>
> Attachments: 2021_v01.patch
>
>
> If we don't have an index for a domain in the join, we can still improve on the nested-loop join using sort join.
> {code}
> R1 = table('movieLog_table');
> R2 = table('stockCompany_info');
> result = R1.join(R1.studioName = R2.corporation) and R2;
> {code}
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Issue Comment Edited: (HADOOP-2021) Sort Join Implementation
Posted by "Edward Yoon (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-2021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12543537 ]
udanax edited comment on HADOOP-2021 at 11/19/07 4:25 AM:
---------------------------------------------------------------
I was found a bug in TableMap, TableReduce classes.
Row iterator of Map/Reduce function makes a lot of duplicated qualifier of column.
was (Author: udanax):
I found a bug in TableMap, TableReduce classes.
Row iterator makes a lot of duplicated qualifier of column.
> Sort Join Implementation
> ------------------------
>
> Key: HADOOP-2021
> URL: https://issues.apache.org/jira/browse/HADOOP-2021
> Project: Hadoop
> Issue Type: Sub-task
> Components: contrib/hbase
> Affects Versions: 0.14.1
> Environment: all environments
> Reporter: Edward Yoon
> Priority: Minor
> Fix For: 0.16.0
>
> Attachments: 2021_v01.patch, 2021_v02.txt
>
>
> If we don't have an index for a domain in the join, we can still improve on the nested-loop join using sort join.
> 1. make a sorted set temp file for sort join using MR job
> 2. make a new Relation table on hbase
> {code}
> R1 = table('movieLog_table');
> R2 = table('stockCompany_info');
> result = R1.join(R1.studioName = R2.corporation) and R2;
> {code}
> ----
> {code}
> r1
> a b c
> ======================
> row1 a1 b1 c1
> row2 a2 b2 c2
> row3 a1 b3 c3
> r2
> e f
> ==================
> row1 e1 a1
> row2 e2 f2
> row3 e3 f3
> row4 e4 a1
> row5 e5 a2
> r1 = table('r1');
> r2 = table('r2');
> r3 = r1.join(r1.a = r2.f) and r2;
> ---------------------------------------------
> temp table T : Sorted set by "f"
> row e f
> ===========================
> a1 row:row1 e1 a1
> row:row4 e4
> a2 row:row5 e5 a2
> f2 row:row2 e2 f2
> f3 row:row3 e3 f3
> ---------------------------------------------
> r3
> r1.row a b c r2.row e f
> ===================================================
> row1.row1 row1 a1 b1 c1 row1 e1 a1
> row1.row4 row1 a1 b1 c1 row4 e4 a1
> row2.row5 row2 a2 b2 c2 row5 e5 a2
> row3.row1 row3 a1 b3 c3 row1 e1 a1
> row3.row4 row3 a1 b3 c3 row4 e4 a1
> {code}
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Assigned: (HADOOP-2021) Sort Join Implementation
Posted by "Edward Yoon (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-2021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Edward Yoon reassigned HADOOP-2021:
-----------------------------------
Assignee: Edward Yoon
> Sort Join Implementation
> ------------------------
>
> Key: HADOOP-2021
> URL: https://issues.apache.org/jira/browse/HADOOP-2021
> Project: Hadoop
> Issue Type: Sub-task
> Components: contrib/hbase
> Affects Versions: 0.14.1
> Environment: all environments
> Reporter: Edward Yoon
> Assignee: Edward Yoon
> Priority: Minor
> Fix For: 0.16.0
>
> Attachments: 2021_v01.patch, 2021_v02.txt
>
>
> If we don't have an index for a domain in the join, we can still improve on the nested-loop join using sort join.
> {code}
> R1 = table('movieLog_table');
> R2 = table('stockCompany_info');
> result = R1.join(R1.studioName = R2.corporation) and R2;
> {code}
> ----
> {code}
> r1
> a b c
> ======================
> row1 a1 b1 c1
> row2 a2 b2 c2
> row3 a1 b3 c3
> r2
> e f
> ==================
> row1 e1 a1
> row2 e2 f2
> row3 e3 f3
> row4 e4 a1
> row5 e5 a2
> r1 = table('r1');
> r2 = table('r2');
> r3 = r1.join(r1.a = r2.f) and r2;
> ---------------------------------------------
> temp table T : Sorted set by "f"
> row e f
> ===========================
> a1 row:row1 e1 a1
> row:row4 e4
> a2 row:row5 e5 a2
> f2 row:row2 e2 f2
> f3 row:row3 e3 f3
> ---------------------------------------------
> r3
> r1.row a b c r2.row e f
> ===================================================
> row1.row1 row1 a1 b1 c1 row1 e1 a1
> row1.row4 row1 a1 b1 c1 row4 e4 a1
> row2.row5 row2 a2 b2 c2 row5 e5 a2
> row3.row1 row3 a1 b3 c3 row1 e1 a1
> row3.row4 row3 a1 b3 c3 row4 e4 a1
> {code}
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Issue Comment Edited: (HADOOP-2021) Sort Join Implementation
Posted by "Edward Yoon (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-2021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12543333 ]
udanax edited comment on HADOOP-2021 at 11/17/07 7:17 PM:
---------------------------------------------------------------
But, Can not insert duplicate key row in hbase table.
so........
{code}
r3
a b c e f
==============================
row1.row4 a1 b1 c1 e1 a1
row1.row4 a1 b1 c1 e4 a1
{code}
hmm.
is it all right in theory ??
If you have a idea, let me know.
was (Author: udanax):
But, Can not insert duplicate key row in hbase table.
so........
r3
a b c e f
===========================
row1.row4 a1 b1 c1 e1 a1
row1.row4 a1 b1 c1 e4 a1
hmm.
is it all right in theory ??
I need a any advice.
> Sort Join Implementation
> ------------------------
>
> Key: HADOOP-2021
> URL: https://issues.apache.org/jira/browse/HADOOP-2021
> Project: Hadoop
> Issue Type: Sub-task
> Components: contrib/hbase
> Affects Versions: 0.14.1
> Environment: all environments
> Reporter: Edward Yoon
> Priority: Minor
> Fix For: 0.16.0
>
> Attachments: 2021_v01.patch
>
>
> If we don't have an index for a domain in the join, we can still improve on the nested-loop join using sort join.
> {code}
> R1 = table('movieLog_table');
> R2 = table('stockCompany_info');
> result = R1.join(R1.studioName = R2.corporation) and R2;
> {code}
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (HADOOP-2021) Sort Join Implementation
Posted by "Edward Yoon (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-2021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Edward Yoon updated HADOOP-2021:
--------------------------------
Attachment: 2021_v05
I was tested successfully.
{code}
test:
[echo] contrib: hbase
[delete] Deleting directory /root/workspace/hadoop/build/contrib/hbase/test/logs
[mkdir] Created dir: /root/workspace/hadoop/build/contrib/hbase/test/logs
[junit] Running org.apache.hadoop.hbase.shell.TestHBaseShell
[junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 112.719 sec
[junit] Running org.apache.hadoop.hbase.shell.TestSubstitutionVariables
[junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 0.017 sec
[junit] Running org.apache.hadoop.hbase.shell.algebra.TestBooleanCondition
[junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.044 sec
[junit] Running org.apache.hadoop.hbase.shell.algebra.TestBooleanTermFilter
[junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 82.184 sec
[junit] Running org.apache.hadoop.hbase.shell.algebra.TestJoinCondition
[junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.161 sec
[junit] Running org.apache.hadoop.hbase.shell.algebra.TestSortJoinMapReduce
[junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 96.241 sec
[junit] Running org.apache.hadoop.hbase.shell.algebra.TestTableJoinMapReduce
[junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 28.076 sec
[junit] Running org.apache.hadoop.hbase.util.TestBase64
[junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.135 sec
[junit] Running org.apache.hadoop.hbase.util.TestKeying
[junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.011 sec
[junit] Running org.onelab.test.TestFilter
[junit] Tests run: 3, Failures: 0, Errors: 0, Time elapsed: 0.013 sec
BUILD SUCCESSFUL
Total time: 5 minutes 29 seconds
bash-3.00#
{code}
> Sort Join Implementation
> ------------------------
>
> Key: HADOOP-2021
> URL: https://issues.apache.org/jira/browse/HADOOP-2021
> Project: Hadoop
> Issue Type: Sub-task
> Components: contrib/hbase
> Affects Versions: 0.14.1
> Environment: all environments
> Reporter: Edward Yoon
> Assignee: Edward Yoon
> Priority: Minor
> Fix For: 0.16.0
>
> Attachments: 2021_v01.patch, 2021_v02.txt, 2021_v04.patch, 2021_v05
>
>
> If we don't have an index for a domain in the join, we can still improve on the nested-loop join using sort join.
> {code}
> R1 = table('movieLog_table');
> R2 = table('stockCompany_info');
> result = R1.join(R1.studioName = R2.corporation) and R2;
> {code}
> ----
> {code}
> r1
> a b c
> ======================
> row1 a1 b1 c1
> row2 a2 b2 c2
> row3 a1 b3 c3
> r2
> e f
> ==================
> row1 e1 a1
> row2 e2 f2
> row3 e3 f3
> row4 e4 a1
> row5 e5 a2
> r1 = table('r1');
> r2 = table('r2');
> r3 = r1.join(r1.a = r2.f) and r2;
> ---------------------------------------------
> temp table T : Sorted set by "f"
> row e f
> ===========================
> a1 row:row1 e1 a1
> row:row4 e4
> a2 row:row5 e5 a2
> f2 row:row2 e2 f2
> f3 row:row3 e3 f3
> ---------------------------------------------
> r3
> r1.row a b c r2.row e f
> ===================================================
> row1.row1 row1 a1 b1 c1 row1 e1 a1
> row1.row4 row1 a1 b1 c1 row4 e4 a1
> row2.row5 row2 a2 b2 c2 row5 e5 a2
> row3.row1 row3 a1 b3 c3 row1 e1 a1
> row3.row4 row3 a1 b3 c3 row4 e4 a1
> {code}
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (HADOOP-2021) Sort Join Implementation
Posted by "Edward Yoon (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-2021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Edward Yoon updated HADOOP-2021:
--------------------------------
Summary: Sort Join Implementation (was: θ Join Condition)
> Sort Join Implementation
> ------------------------
>
> Key: HADOOP-2021
> URL: https://issues.apache.org/jira/browse/HADOOP-2021
> Project: Hadoop
> Issue Type: Sub-task
> Components: contrib/hbase
> Affects Versions: 0.14.1
> Environment: all environments
> Reporter: Edward Yoon
> Priority: Minor
> Fix For: 0.16.0
>
>
> * stands for the usual comparison operators '<','<=', '>', '>=', '!=', '='
> * comparing terms in the Q clauses can be arbitrarily connected with boolean operators AND, NOT, OR
> {code}
> ex. (a.value = b.value and a.key != b.key)
> {code}
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (HADOOP-2021) Sort Join Implementation
Posted by "Edward Yoon (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-2021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12543362 ]
Edward Yoon commented on HADOOP-2021:
-------------------------------------
On second thought, I changed the result format.
And I had decision to store the join result as describe below.
Thanks, jimk and stack.
but, if you find a errors or inconsistencies, let me know.
{code}
r3
r1.row a b c r2.row e f
=============================================
row1.row1 row1 a1 b1 c1 row1 e1 a1
row1.row4 row1 a1 b1 c1 row2 e4 a1
{code}
> Sort Join Implementation
> ------------------------
>
> Key: HADOOP-2021
> URL: https://issues.apache.org/jira/browse/HADOOP-2021
> Project: Hadoop
> Issue Type: Sub-task
> Components: contrib/hbase
> Affects Versions: 0.14.1
> Environment: all environments
> Reporter: Edward Yoon
> Priority: Minor
> Fix For: 0.16.0
>
> Attachments: 2021_v01.patch
>
>
> If we don't have an index for a domain in the join, we can still improve on the nested-loop join using sort join.
> {code}
> R1 = table('movieLog_table');
> R2 = table('stockCompany_info');
> result = R1.join(R1.studioName = R2.corporation) and R2;
> {code}
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (HADOOP-2021) [Hbase Shell] Sort Join
Implementation
Posted by "Edward Yoon (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-2021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12547395 ]
Edward Yoon commented on HADOOP-2021:
-------------------------------------
Hudson seems not run..
Anyone know about it?
> [Hbase Shell] Sort Join Implementation
> --------------------------------------
>
> Key: HADOOP-2021
> URL: https://issues.apache.org/jira/browse/HADOOP-2021
> Project: Hadoop
> Issue Type: Improvement
> Components: contrib/hbase
> Affects Versions: 0.15.0
> Environment: all environments
> Reporter: Edward Yoon
> Assignee: Edward Yoon
> Fix For: 0.16.0
>
> Attachments: 2021_v01.patch, 2021_v02.txt, 2021_v04.patch, 2021_v05, 2021_v05.patch, 2021_v06.patch
>
>
> If we don't have an index for a domain in the join, we can still improve on the nested-loop join using sort join.
> {code}
> R1 = table('movieLog_table');
> R2 = table('stockCompany_info');
> result = R1.join(R1.studioName = R2.corporation) and R2;
> {code}
> ----
> {code}
> r1
> a b c
> ======================
> row1 a1 b1 c1
> row2 a2 b2 c2
> row3 a1 b3 c3
> r2
> e f
> ==================
> row1 e1 a1
> row2 e2 f2
> row3 e3 f3
> row4 e4 a1
> row5 e5 a2
> r1 = table('r1');
> r2 = table('r2');
> r3 = r1.join(r1.a = r2.f) and r2;
> ---------------------------------------------
> temp table T : Sorted set by "f"
> row
> =============
> a1 row:row1
> row:row4
> a2 row:row5
> f2 row:row2
> f3 row:row3
> ---------------------
> r3
> r1.row a b c r2.row e f
> ===================================================
> row1.row1 row1 a1 b1 c1 row1 e1 a1
> row1.row4 row1 a1 b1 c1 row4 e4 a1
> row2.row5 row2 a2 b2 c2 row5 e5 a2
> row3.row1 row3 a1 b3 c3 row1 e1 a1
> row3.row4 row3 a1 b3 c3 row4 e4 a1
> {code}
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (HADOOP-2021) [Hbase Shell] Sort Join
Implementation
Posted by "Edward Yoon (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-2021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Edward Yoon updated HADOOP-2021:
--------------------------------
Status: Patch Available (was: Open)
re-submitting after change the affects versions/s.
> [Hbase Shell] Sort Join Implementation
> --------------------------------------
>
> Key: HADOOP-2021
> URL: https://issues.apache.org/jira/browse/HADOOP-2021
> Project: Hadoop
> Issue Type: Improvement
> Components: contrib/hbase
> Affects Versions: 0.16.0
> Environment: all environments
> Reporter: Edward Yoon
> Assignee: Edward Yoon
> Fix For: 0.16.0
>
> Attachments: 2021_v01.patch, 2021_v02.txt, 2021_v04.patch, 2021_v05, 2021_v05.patch, 2021_v06.patch
>
>
> If we don't have an index for a domain in the join, we can still improve on the nested-loop join using sort join.
> {code}
> R1 = table('movieLog_table');
> R2 = table('stockCompany_info');
> result = R1.join(R1.studioName = R2.corporation) and R2;
> {code}
> ----
> {code}
> r1
> a b c
> ======================
> row1 a1 b1 c1
> row2 a2 b2 c2
> row3 a1 b3 c3
> r2
> e f
> ==================
> row1 e1 a1
> row2 e2 f2
> row3 e3 f3
> row4 e4 a1
> row5 e5 a2
> r1 = table('r1');
> r2 = table('r2');
> r3 = r1.join(r1.a = r2.f) and r2;
> ---------------------------------------------
> temp table T : Sorted set by "f"
> row
> =============
> a1 row:row1
> row:row4
> a2 row:row5
> f2 row:row2
> f3 row:row3
> ---------------------
> r3
> r1.row a b c r2.row e f
> ===================================================
> row1.row1 row1 a1 b1 c1 row1 e1 a1
> row1.row4 row1 a1 b1 c1 row4 e4 a1
> row2.row5 row2 a2 b2 c2 row5 e5 a2
> row3.row1 row3 a1 b3 c3 row1 e1 a1
> row3.row4 row3 a1 b3 c3 row4 e4 a1
> {code}
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (HADOOP-2021) [Hbase Shell] Sort Join
Implementation
Posted by "Edward Yoon (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-2021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Edward Yoon updated HADOOP-2021:
--------------------------------
Status: Patch Available (was: Open)
> [Hbase Shell] Sort Join Implementation
> --------------------------------------
>
> Key: HADOOP-2021
> URL: https://issues.apache.org/jira/browse/HADOOP-2021
> Project: Hadoop
> Issue Type: Improvement
> Components: contrib/hbase
> Affects Versions: 0.15.1
> Environment: all environments
> Reporter: Edward Yoon
> Assignee: Edward Yoon
> Fix For: 0.16.0
>
> Attachments: 2021_v01.patch, 2021_v02.txt, 2021_v04.patch, 2021_v05, 2021_v05.patch, 2021_v06.patch
>
>
> If we don't have an index for a domain in the join, we can still improve on the nested-loop join using sort join.
> {code}
> R1 = table('movieLog_table');
> R2 = table('stockCompany_info');
> result = R1.join(R1.studioName = R2.corporation) and R2;
> {code}
> ----
> {code}
> r1
> a b c
> ======================
> row1 a1 b1 c1
> row2 a2 b2 c2
> row3 a1 b3 c3
> r2
> e f
> ==================
> row1 e1 a1
> row2 e2 f2
> row3 e3 f3
> row4 e4 a1
> row5 e5 a2
> r1 = table('r1');
> r2 = table('r2');
> r3 = r1.join(r1.a = r2.f) and r2;
> ---------------------------------------------
> temp table T : Sorted set by "f"
> row
> =============
> a1 row:row1
> row:row4
> a2 row:row5
> f2 row:row2
> f3 row:row3
> ---------------------
> r3
> r1.row a b c r2.row e f
> ===================================================
> row1.row1 row1 a1 b1 c1 row1 e1 a1
> row1.row4 row1 a1 b1 c1 row4 e4 a1
> row2.row5 row2 a2 b2 c2 row5 e5 a2
> row3.row1 row3 a1 b3 c3 row1 e1 a1
> row3.row4 row3 a1 b3 c3 row4 e4 a1
> {code}
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Issue Comment Edited: (HADOOP-2021) Sort Join Implementation
Posted by "Edward Yoon (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-2021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12543333 ]
udanax edited comment on HADOOP-2021 at 11/17/07 7:32 PM:
---------------------------------------------------------------
But, Can not insert duplicate key row in hbase table.
so........
{code}
r3
a b c e f
==============================
row1.row1 a1 b1 c1 e1 a1
row1.row4 a1 b1 c1 e4 a1
{code}
Then, Can i get a r1.row set or r2.row set??
hmm.
Also, is it all right in theory ??
If you have any ideas, let me know.
was (Author: udanax):
But, Can not insert duplicate key row in hbase table.
so........
{code}
r3
a b c e f
==============================
row1.row4 a1 b1 c1 e1 a1
row1.row4 a1 b1 c1 e4 a1
{code}
Then, Can i get a r1.row set or r2.row set??
hmm.
Also, is it all right in theory ??
If you have any ideas, let me know.
> Sort Join Implementation
> ------------------------
>
> Key: HADOOP-2021
> URL: https://issues.apache.org/jira/browse/HADOOP-2021
> Project: Hadoop
> Issue Type: Sub-task
> Components: contrib/hbase
> Affects Versions: 0.14.1
> Environment: all environments
> Reporter: Edward Yoon
> Priority: Minor
> Fix For: 0.16.0
>
> Attachments: 2021_v01.patch
>
>
> If we don't have an index for a domain in the join, we can still improve on the nested-loop join using sort join.
> {code}
> R1 = table('movieLog_table');
> R2 = table('stockCompany_info');
> result = R1.join(R1.studioName = R2.corporation) and R2;
> {code}
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (HADOOP-2021) Sort Join Implementation
Posted by "Edward Yoon (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-2021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Edward Yoon updated HADOOP-2021:
--------------------------------
Description:
If we don't have an index for a domain in the join, we can still improve on the nested-loop join using sort join.
1. make a sorted set temp file for sort join using MR job on hdfs
2. make a new Relation table on hbase
{code}
R1 = table('movieLog_table');
R2 = table('stockCompany_info');
result = R1.join(R1.studioName = R2.corporation) and R2;
{code}
was:
If we don't have an index for a domain in the join, we can still improve on the nested-loop join using sort join.
1. make a sorted set temp file for sort join using MR job
2. make a new Relation table
{code}
R1 = table('movieLog_table');
R2 = table('stockCompany_info');
result = R1.join(R1.studioName = R2.corporation) and R2;
{code}
> Sort Join Implementation
> ------------------------
>
> Key: HADOOP-2021
> URL: https://issues.apache.org/jira/browse/HADOOP-2021
> Project: Hadoop
> Issue Type: Sub-task
> Components: contrib/hbase
> Affects Versions: 0.14.1
> Environment: all environments
> Reporter: Edward Yoon
> Priority: Minor
> Fix For: 0.16.0
>
> Attachments: 2021_v01.patch
>
>
> If we don't have an index for a domain in the join, we can still improve on the nested-loop join using sort join.
> 1. make a sorted set temp file for sort join using MR job on hdfs
> 2. make a new Relation table on hbase
> {code}
> R1 = table('movieLog_table');
> R2 = table('stockCompany_info');
> result = R1.join(R1.studioName = R2.corporation) and R2;
> {code}
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (HADOOP-2021) Sort Join Implementation
Posted by "Edward Yoon (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-2021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Edward Yoon updated HADOOP-2021:
--------------------------------
Issue Type: Improvement (was: Sub-task)
Parent: (was: HADOOP-1608)
> Sort Join Implementation
> ------------------------
>
> Key: HADOOP-2021
> URL: https://issues.apache.org/jira/browse/HADOOP-2021
> Project: Hadoop
> Issue Type: Improvement
> Components: contrib/hbase
> Affects Versions: 0.15.0
> Environment: all environments
> Reporter: Edward Yoon
> Assignee: Edward Yoon
> Priority: Minor
> Fix For: 0.16.0
>
> Attachments: 2021_v01.patch, 2021_v02.txt, 2021_v04.patch, 2021_v05, 2021_v05.patch, 2021_v06.patch
>
>
> If we don't have an index for a domain in the join, we can still improve on the nested-loop join using sort join.
> {code}
> R1 = table('movieLog_table');
> R2 = table('stockCompany_info');
> result = R1.join(R1.studioName = R2.corporation) and R2;
> {code}
> ----
> {code}
> r1
> a b c
> ======================
> row1 a1 b1 c1
> row2 a2 b2 c2
> row3 a1 b3 c3
> r2
> e f
> ==================
> row1 e1 a1
> row2 e2 f2
> row3 e3 f3
> row4 e4 a1
> row5 e5 a2
> r1 = table('r1');
> r2 = table('r2');
> r3 = r1.join(r1.a = r2.f) and r2;
> ---------------------------------------------
> temp table T : Sorted set by "f"
> row
> =============
> a1 row:row1
> row:row4
> a2 row:row5
> f2 row:row2
> f3 row:row3
> ---------------------
> r3
> r1.row a b c r2.row e f
> ===================================================
> row1.row1 row1 a1 b1 c1 row1 e1 a1
> row1.row4 row1 a1 b1 c1 row4 e4 a1
> row2.row5 row2 a2 b2 c2 row5 e5 a2
> row3.row1 row3 a1 b3 c3 row1 e1 a1
> row3.row4 row3 a1 b3 c3 row4 e4 a1
> {code}
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (HADOOP-2021) Sort Join Implementation
Posted by "Edward Yoon (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-2021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Edward Yoon updated HADOOP-2021:
--------------------------------
Affects Version/s: (was: 0.14.1)
0.15.0
Status: Patch Available (was: Open)
re-submitting.
> Sort Join Implementation
> ------------------------
>
> Key: HADOOP-2021
> URL: https://issues.apache.org/jira/browse/HADOOP-2021
> Project: Hadoop
> Issue Type: Sub-task
> Components: contrib/hbase
> Affects Versions: 0.15.0
> Environment: all environments
> Reporter: Edward Yoon
> Assignee: Edward Yoon
> Priority: Minor
> Fix For: 0.16.0
>
> Attachments: 2021_v01.patch, 2021_v02.txt, 2021_v04.patch, 2021_v05, 2021_v05.patch, 2021_v06.patch
>
>
> If we don't have an index for a domain in the join, we can still improve on the nested-loop join using sort join.
> {code}
> R1 = table('movieLog_table');
> R2 = table('stockCompany_info');
> result = R1.join(R1.studioName = R2.corporation) and R2;
> {code}
> ----
> {code}
> r1
> a b c
> ======================
> row1 a1 b1 c1
> row2 a2 b2 c2
> row3 a1 b3 c3
> r2
> e f
> ==================
> row1 e1 a1
> row2 e2 f2
> row3 e3 f3
> row4 e4 a1
> row5 e5 a2
> r1 = table('r1');
> r2 = table('r2');
> r3 = r1.join(r1.a = r2.f) and r2;
> ---------------------------------------------
> temp table T : Sorted set by "f"
> row
> =============
> a1 row:row1
> row:row4
> a2 row:row5
> f2 row:row2
> f3 row:row3
> ---------------------
> r3
> r1.row a b c r2.row e f
> ===================================================
> row1.row1 row1 a1 b1 c1 row1 e1 a1
> row1.row4 row1 a1 b1 c1 row4 e4 a1
> row2.row5 row2 a2 b2 c2 row5 e5 a2
> row3.row1 row3 a1 b3 c3 row1 e1 a1
> row3.row4 row3 a1 b3 c3 row4 e4 a1
> {code}
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (HADOOP-2021) Sort Join Implementation
Posted by "Edward Yoon (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-2021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Edward Yoon updated HADOOP-2021:
--------------------------------
Attachment: 2021_v04.patch
> Sort Join Implementation
> ------------------------
>
> Key: HADOOP-2021
> URL: https://issues.apache.org/jira/browse/HADOOP-2021
> Project: Hadoop
> Issue Type: Sub-task
> Components: contrib/hbase
> Affects Versions: 0.14.1
> Environment: all environments
> Reporter: Edward Yoon
> Assignee: Edward Yoon
> Priority: Minor
> Fix For: 0.16.0
>
> Attachments: 2021_v01.patch, 2021_v02.txt, 2021_v04.patch
>
>
> If we don't have an index for a domain in the join, we can still improve on the nested-loop join using sort join.
> {code}
> R1 = table('movieLog_table');
> R2 = table('stockCompany_info');
> result = R1.join(R1.studioName = R2.corporation) and R2;
> {code}
> ----
> {code}
> r1
> a b c
> ======================
> row1 a1 b1 c1
> row2 a2 b2 c2
> row3 a1 b3 c3
> r2
> e f
> ==================
> row1 e1 a1
> row2 e2 f2
> row3 e3 f3
> row4 e4 a1
> row5 e5 a2
> r1 = table('r1');
> r2 = table('r2');
> r3 = r1.join(r1.a = r2.f) and r2;
> ---------------------------------------------
> temp table T : Sorted set by "f"
> row e f
> ===========================
> a1 row:row1 e1 a1
> row:row4 e4
> a2 row:row5 e5 a2
> f2 row:row2 e2 f2
> f3 row:row3 e3 f3
> ---------------------------------------------
> r3
> r1.row a b c r2.row e f
> ===================================================
> row1.row1 row1 a1 b1 c1 row1 e1 a1
> row1.row4 row1 a1 b1 c1 row4 e4 a1
> row2.row5 row2 a2 b2 c2 row5 e5 a2
> row3.row1 row3 a1 b3 c3 row1 e1 a1
> row3.row4 row3 a1 b3 c3 row4 e4 a1
> {code}
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (HADOOP-2021) Sort Join Implementation
Posted by "Edward Yoon (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-2021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Edward Yoon updated HADOOP-2021:
--------------------------------
Status: Open (was: Patch Available)
> Sort Join Implementation
> ------------------------
>
> Key: HADOOP-2021
> URL: https://issues.apache.org/jira/browse/HADOOP-2021
> Project: Hadoop
> Issue Type: Improvement
> Components: contrib/hbase
> Affects Versions: 0.15.0
> Environment: all environments
> Reporter: Edward Yoon
> Assignee: Edward Yoon
> Priority: Minor
> Fix For: 0.16.0
>
> Attachments: 2021_v01.patch, 2021_v02.txt, 2021_v04.patch, 2021_v05, 2021_v05.patch, 2021_v06.patch
>
>
> If we don't have an index for a domain in the join, we can still improve on the nested-loop join using sort join.
> {code}
> R1 = table('movieLog_table');
> R2 = table('stockCompany_info');
> result = R1.join(R1.studioName = R2.corporation) and R2;
> {code}
> ----
> {code}
> r1
> a b c
> ======================
> row1 a1 b1 c1
> row2 a2 b2 c2
> row3 a1 b3 c3
> r2
> e f
> ==================
> row1 e1 a1
> row2 e2 f2
> row3 e3 f3
> row4 e4 a1
> row5 e5 a2
> r1 = table('r1');
> r2 = table('r2');
> r3 = r1.join(r1.a = r2.f) and r2;
> ---------------------------------------------
> temp table T : Sorted set by "f"
> row
> =============
> a1 row:row1
> row:row4
> a2 row:row5
> f2 row:row2
> f3 row:row3
> ---------------------
> r3
> r1.row a b c r2.row e f
> ===================================================
> row1.row1 row1 a1 b1 c1 row1 e1 a1
> row1.row4 row1 a1 b1 c1 row4 e4 a1
> row2.row5 row2 a2 b2 c2 row5 e5 a2
> row3.row1 row3 a1 b3 c3 row1 e1 a1
> row3.row4 row3 a1 b3 c3 row4 e4 a1
> {code}
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (HADOOP-2021) Sort Join Implementation
Posted by "Edward Yoon (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-2021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Edward Yoon updated HADOOP-2021:
--------------------------------
Description:
If we don't have an index for a domain in the join, we can still improve on the nested-loop join using sort join.
{code}
R1 = table('movieLog_table');
R2 = table('stockCompany_info');
result = R1.join(R1.studioName = R2.corporation) and R2;
{code}
was:If we don't have an index for a domain in the join, we can still improve on the nested-loop join using sort join.
update description using example.
> Sort Join Implementation
> ------------------------
>
> Key: HADOOP-2021
> URL: https://issues.apache.org/jira/browse/HADOOP-2021
> Project: Hadoop
> Issue Type: Sub-task
> Components: contrib/hbase
> Affects Versions: 0.14.1
> Environment: all environments
> Reporter: Edward Yoon
> Priority: Minor
> Fix For: 0.16.0
>
>
> If we don't have an index for a domain in the join, we can still improve on the nested-loop join using sort join.
> {code}
> R1 = table('movieLog_table');
> R2 = table('stockCompany_info');
> result = R1.join(R1.studioName = R2.corporation) and R2;
> {code}
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (HADOOP-2021) θ Join Condition
Posted by "Edward Yoon (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-2021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Edward Yoon updated HADOOP-2021:
--------------------------------
Component/s: contrib/hbase
Fix Version/s: 0.16.0
Priority: Minor (was: Major)
Description:
* stands for the usual comparison operators '<','<=', '>', '>=', '!=', '='
* comparing terms in the Q clauses can be arbitrarily connected with boolean operators AND, NOT, OR
{code}
ex. (a.value = b.value and a.key != b.key)
{code}
Environment: all environments
Affects Version/s: 0.14.1
> θ Join Condition
> ----------------
>
> Key: HADOOP-2021
> URL: https://issues.apache.org/jira/browse/HADOOP-2021
> Project: Hadoop
> Issue Type: Sub-task
> Components: contrib/hbase
> Affects Versions: 0.14.1
> Environment: all environments
> Reporter: Edward Yoon
> Priority: Minor
> Fix For: 0.16.0
>
>
> * stands for the usual comparison operators '<','<=', '>', '>=', '!=', '='
> * comparing terms in the Q clauses can be arbitrarily connected with boolean operators AND, NOT, OR
> {code}
> ex. (a.value = b.value and a.key != b.key)
> {code}
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (HADOOP-2021) Sort Join Implementation
Posted by "Edward Yoon (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-2021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Edward Yoon updated HADOOP-2021:
--------------------------------
Status: Patch Available (was: Open)
re-submitting.
> Sort Join Implementation
> ------------------------
>
> Key: HADOOP-2021
> URL: https://issues.apache.org/jira/browse/HADOOP-2021
> Project: Hadoop
> Issue Type: Sub-task
> Components: contrib/hbase
> Affects Versions: 0.14.1
> Environment: all environments
> Reporter: Edward Yoon
> Assignee: Edward Yoon
> Priority: Minor
> Fix For: 0.16.0
>
> Attachments: 2021_v01.patch, 2021_v02.txt, 2021_v04.patch, 2021_v05, 2021_v05.patch, 2021_v06.patch
>
>
> If we don't have an index for a domain in the join, we can still improve on the nested-loop join using sort join.
> {code}
> R1 = table('movieLog_table');
> R2 = table('stockCompany_info');
> result = R1.join(R1.studioName = R2.corporation) and R2;
> {code}
> ----
> {code}
> r1
> a b c
> ======================
> row1 a1 b1 c1
> row2 a2 b2 c2
> row3 a1 b3 c3
> r2
> e f
> ==================
> row1 e1 a1
> row2 e2 f2
> row3 e3 f3
> row4 e4 a1
> row5 e5 a2
> r1 = table('r1');
> r2 = table('r2');
> r3 = r1.join(r1.a = r2.f) and r2;
> ---------------------------------------------
> temp table T : Sorted set by "f"
> row e f
> ===========================
> a1 row:row1 e1 a1
> row:row4 e4
> a2 row:row5 e5 a2
> f2 row:row2 e2 f2
> f3 row:row3 e3 f3
> ---------------------------------------------
> r3
> r1.row a b c r2.row e f
> ===================================================
> row1.row1 row1 a1 b1 c1 row1 e1 a1
> row1.row4 row1 a1 b1 c1 row4 e4 a1
> row2.row5 row2 a2 b2 c2 row5 e5 a2
> row3.row1 row3 a1 b3 c3 row1 e1 a1
> row3.row4 row3 a1 b3 c3 row4 e4 a1
> {code}
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Issue Comment Edited: (HADOOP-2021) Sort Join Implementation
Posted by "Edward Yoon (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-2021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12543333 ]
udanax edited comment on HADOOP-2021 at 11/17/07 7:24 PM:
---------------------------------------------------------------
But, Can not insert duplicate key row in hbase table.
so........
{code}
r3
a b c e f
==============================
row1.row4 a1 b1 c1 e1 a1
row1.row4 a1 b1 c1 e4 a1
{code}
Then, Can i get a r1.row set or r2.row set??
hmm.
Also, is it all right in theory ??
If you have any ideas, let me know.
was (Author: udanax):
But, Can not insert duplicate key row in hbase table.
so........
{code}
r3
a b c e f
==============================
row1.row4 a1 b1 c1 e1 a1
row1.row4 a1 b1 c1 e4 a1
{code}
hmm.
is it all right in theory ??
If you have any ideas, let me know.
> Sort Join Implementation
> ------------------------
>
> Key: HADOOP-2021
> URL: https://issues.apache.org/jira/browse/HADOOP-2021
> Project: Hadoop
> Issue Type: Sub-task
> Components: contrib/hbase
> Affects Versions: 0.14.1
> Environment: all environments
> Reporter: Edward Yoon
> Priority: Minor
> Fix For: 0.16.0
>
> Attachments: 2021_v01.patch
>
>
> If we don't have an index for a domain in the join, we can still improve on the nested-loop join using sort join.
> {code}
> R1 = table('movieLog_table');
> R2 = table('stockCompany_info');
> result = R1.join(R1.studioName = R2.corporation) and R2;
> {code}
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (HADOOP-2021) Sort Join Implementation
Posted by "Edward Yoon (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-2021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Edward Yoon updated HADOOP-2021:
--------------------------------
Description: If we don't have an index for a domain in the join, we can still improve on the nested-loop join using sort join. (was: * stands for the usual comparison operators '<','<=', '>', '>=', '!=', '='
* comparing terms in the Q clauses can be arbitrarily connected with boolean operators AND, NOT, OR
{code}
ex. (a.value = b.value and a.key != b.key)
{code})
> Sort Join Implementation
> ------------------------
>
> Key: HADOOP-2021
> URL: https://issues.apache.org/jira/browse/HADOOP-2021
> Project: Hadoop
> Issue Type: Sub-task
> Components: contrib/hbase
> Affects Versions: 0.14.1
> Environment: all environments
> Reporter: Edward Yoon
> Priority: Minor
> Fix For: 0.16.0
>
>
> If we don't have an index for a domain in the join, we can still improve on the nested-loop join using sort join.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (HADOOP-2021) Sort Join Implementation
Posted by "Edward Yoon (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-2021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Edward Yoon updated HADOOP-2021:
--------------------------------
Attachment: 2021_v01.patch
> Sort Join Implementation
> ------------------------
>
> Key: HADOOP-2021
> URL: https://issues.apache.org/jira/browse/HADOOP-2021
> Project: Hadoop
> Issue Type: Sub-task
> Components: contrib/hbase
> Affects Versions: 0.14.1
> Environment: all environments
> Reporter: Edward Yoon
> Priority: Minor
> Fix For: 0.16.0
>
> Attachments: 2021_v01.patch
>
>
> If we don't have an index for a domain in the join, we can still improve on the nested-loop join using sort join.
> {code}
> R1 = table('movieLog_table');
> R2 = table('stockCompany_info');
> result = R1.join(R1.studioName = R2.corporation) and R2;
> {code}
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (HADOOP-2021) Sort Join Implementation
Posted by "Edward Yoon (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-2021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Edward Yoon updated HADOOP-2021:
--------------------------------
Attachment: (was: patch.txt)
> Sort Join Implementation
> ------------------------
>
> Key: HADOOP-2021
> URL: https://issues.apache.org/jira/browse/HADOOP-2021
> Project: Hadoop
> Issue Type: Sub-task
> Components: contrib/hbase
> Affects Versions: 0.14.1
> Environment: all environments
> Reporter: Edward Yoon
> Assignee: Edward Yoon
> Priority: Minor
> Fix For: 0.16.0
>
> Attachments: 2021_v01.patch, 2021_v02.txt, 2021_v04.patch
>
>
> If we don't have an index for a domain in the join, we can still improve on the nested-loop join using sort join.
> {code}
> R1 = table('movieLog_table');
> R2 = table('stockCompany_info');
> result = R1.join(R1.studioName = R2.corporation) and R2;
> {code}
> ----
> {code}
> r1
> a b c
> ======================
> row1 a1 b1 c1
> row2 a2 b2 c2
> row3 a1 b3 c3
> r2
> e f
> ==================
> row1 e1 a1
> row2 e2 f2
> row3 e3 f3
> row4 e4 a1
> row5 e5 a2
> r1 = table('r1');
> r2 = table('r2');
> r3 = r1.join(r1.a = r2.f) and r2;
> ---------------------------------------------
> temp table T : Sorted set by "f"
> row e f
> ===========================
> a1 row:row1 e1 a1
> row:row4 e4
> a2 row:row5 e5 a2
> f2 row:row2 e2 f2
> f3 row:row3 e3 f3
> ---------------------------------------------
> r3
> r1.row a b c r2.row e f
> ===================================================
> row1.row1 row1 a1 b1 c1 row1 e1 a1
> row1.row4 row1 a1 b1 c1 row4 e4 a1
> row2.row5 row2 a2 b2 c2 row5 e5 a2
> row3.row1 row3 a1 b3 c3 row1 e1 a1
> row3.row4 row3 a1 b3 c3 row4 e4 a1
> {code}
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (HADOOP-2021) Sort Join Implementation
Posted by "Edward Yoon (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-2021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Edward Yoon updated HADOOP-2021:
--------------------------------
Description:
If we don't have an index for a domain in the join, we can still improve on the nested-loop join using sort join.
{code}
R1 = table('movieLog_table');
R2 = table('stockCompany_info');
result = R1.join(R1.studioName = R2.corporation) and R2;
{code}
----
{code}
r1
a b c
======================
row1 a1 b1 c1
row2 a2 b2 c2
row3 a1 b3 c3
r2
e f
==================
row1 e1 a1
row2 e2 f2
row3 e3 f3
row4 e4 a1
row5 e5 a2
r1 = table('r1');
r2 = table('r2');
r3 = r1.join(r1.a = r2.f) and r2;
---------------------------------------------
temp table T : Sorted set by "f"
row
=============
a1 row:row1
row:row4
a2 row:row5
f2 row:row2
f3 row:row3
---------------------
r3
r1.row a b c r2.row e f
===================================================
row1.row1 row1 a1 b1 c1 row1 e1 a1
row1.row4 row1 a1 b1 c1 row4 e4 a1
row2.row5 row2 a2 b2 c2 row5 e5 a2
row3.row1 row3 a1 b3 c3 row1 e1 a1
row3.row4 row3 a1 b3 c3 row4 e4 a1
{code}
was:
If we don't have an index for a domain in the join, we can still improve on the nested-loop join using sort join.
{code}
R1 = table('movieLog_table');
R2 = table('stockCompany_info');
result = R1.join(R1.studioName = R2.corporation) and R2;
{code}
----
{code}
r1
a b c
======================
row1 a1 b1 c1
row2 a2 b2 c2
row3 a1 b3 c3
r2
e f
==================
row1 e1 a1
row2 e2 f2
row3 e3 f3
row4 e4 a1
row5 e5 a2
r1 = table('r1');
r2 = table('r2');
r3 = r1.join(r1.a = r2.f) and r2;
---------------------------------------------
temp table T : Sorted set by "f"
row e f
===========================
a1 row:row1 e1 a1
row:row4 e4
a2 row:row5 e5 a2
f2 row:row2 e2 f2
f3 row:row3 e3 f3
---------------------------------------------
r3
r1.row a b c r2.row e f
===================================================
row1.row1 row1 a1 b1 c1 row1 e1 a1
row1.row4 row1 a1 b1 c1 row4 e4 a1
row2.row5 row2 a2 b2 c2 row5 e5 a2
row3.row1 row3 a1 b3 c3 row1 e1 a1
row3.row4 row3 a1 b3 c3 row4 e4 a1
{code}
> Sort Join Implementation
> ------------------------
>
> Key: HADOOP-2021
> URL: https://issues.apache.org/jira/browse/HADOOP-2021
> Project: Hadoop
> Issue Type: Sub-task
> Components: contrib/hbase
> Affects Versions: 0.14.1
> Environment: all environments
> Reporter: Edward Yoon
> Assignee: Edward Yoon
> Priority: Minor
> Fix For: 0.16.0
>
> Attachments: 2021_v01.patch, 2021_v02.txt, 2021_v04.patch, 2021_v05, 2021_v05.patch, 2021_v06.patch
>
>
> If we don't have an index for a domain in the join, we can still improve on the nested-loop join using sort join.
> {code}
> R1 = table('movieLog_table');
> R2 = table('stockCompany_info');
> result = R1.join(R1.studioName = R2.corporation) and R2;
> {code}
> ----
> {code}
> r1
> a b c
> ======================
> row1 a1 b1 c1
> row2 a2 b2 c2
> row3 a1 b3 c3
> r2
> e f
> ==================
> row1 e1 a1
> row2 e2 f2
> row3 e3 f3
> row4 e4 a1
> row5 e5 a2
> r1 = table('r1');
> r2 = table('r2');
> r3 = r1.join(r1.a = r2.f) and r2;
> ---------------------------------------------
> temp table T : Sorted set by "f"
> row
> =============
> a1 row:row1
> row:row4
> a2 row:row5
> f2 row:row2
> f3 row:row3
> ---------------------
> r3
> r1.row a b c r2.row e f
> ===================================================
> row1.row1 row1 a1 b1 c1 row1 e1 a1
> row1.row4 row1 a1 b1 c1 row4 e4 a1
> row2.row5 row2 a2 b2 c2 row5 e5 a2
> row3.row1 row3 a1 b3 c3 row1 e1 a1
> row3.row4 row3 a1 b3 c3 row4 e4 a1
> {code}
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (HADOOP-2021) [Hbase Shell] Sort Join
Implementation
Posted by "Edward Yoon (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-2021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Edward Yoon updated HADOOP-2021:
--------------------------------
Affects Version/s: (was: 0.15.0)
0.15.1
Status: Open (was: Patch Available)
> [Hbase Shell] Sort Join Implementation
> --------------------------------------
>
> Key: HADOOP-2021
> URL: https://issues.apache.org/jira/browse/HADOOP-2021
> Project: Hadoop
> Issue Type: Improvement
> Components: contrib/hbase
> Affects Versions: 0.15.1
> Environment: all environments
> Reporter: Edward Yoon
> Assignee: Edward Yoon
> Fix For: 0.16.0
>
> Attachments: 2021_v01.patch, 2021_v02.txt, 2021_v04.patch, 2021_v05, 2021_v05.patch, 2021_v06.patch
>
>
> If we don't have an index for a domain in the join, we can still improve on the nested-loop join using sort join.
> {code}
> R1 = table('movieLog_table');
> R2 = table('stockCompany_info');
> result = R1.join(R1.studioName = R2.corporation) and R2;
> {code}
> ----
> {code}
> r1
> a b c
> ======================
> row1 a1 b1 c1
> row2 a2 b2 c2
> row3 a1 b3 c3
> r2
> e f
> ==================
> row1 e1 a1
> row2 e2 f2
> row3 e3 f3
> row4 e4 a1
> row5 e5 a2
> r1 = table('r1');
> r2 = table('r2');
> r3 = r1.join(r1.a = r2.f) and r2;
> ---------------------------------------------
> temp table T : Sorted set by "f"
> row
> =============
> a1 row:row1
> row:row4
> a2 row:row5
> f2 row:row2
> f3 row:row3
> ---------------------
> r3
> r1.row a b c r2.row e f
> ===================================================
> row1.row1 row1 a1 b1 c1 row1 e1 a1
> row1.row4 row1 a1 b1 c1 row4 e4 a1
> row2.row5 row2 a2 b2 c2 row5 e5 a2
> row3.row1 row3 a1 b3 c3 row1 e1 a1
> row3.row4 row3 a1 b3 c3 row4 e4 a1
> {code}
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (HADOOP-2021) Sort Join Implementation
Posted by "Edward Yoon (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-2021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Edward Yoon updated HADOOP-2021:
--------------------------------
Attachment: patch.txt
> Sort Join Implementation
> ------------------------
>
> Key: HADOOP-2021
> URL: https://issues.apache.org/jira/browse/HADOOP-2021
> Project: Hadoop
> Issue Type: Sub-task
> Components: contrib/hbase
> Affects Versions: 0.14.1
> Environment: all environments
> Reporter: Edward Yoon
> Assignee: Edward Yoon
> Priority: Minor
> Fix For: 0.16.0
>
> Attachments: 2021_v01.patch, 2021_v02.txt, patch.txt
>
>
> If we don't have an index for a domain in the join, we can still improve on the nested-loop join using sort join.
> {code}
> R1 = table('movieLog_table');
> R2 = table('stockCompany_info');
> result = R1.join(R1.studioName = R2.corporation) and R2;
> {code}
> ----
> {code}
> r1
> a b c
> ======================
> row1 a1 b1 c1
> row2 a2 b2 c2
> row3 a1 b3 c3
> r2
> e f
> ==================
> row1 e1 a1
> row2 e2 f2
> row3 e3 f3
> row4 e4 a1
> row5 e5 a2
> r1 = table('r1');
> r2 = table('r2');
> r3 = r1.join(r1.a = r2.f) and r2;
> ---------------------------------------------
> temp table T : Sorted set by "f"
> row e f
> ===========================
> a1 row:row1 e1 a1
> row:row4 e4
> a2 row:row5 e5 a2
> f2 row:row2 e2 f2
> f3 row:row3 e3 f3
> ---------------------------------------------
> r3
> r1.row a b c r2.row e f
> ===================================================
> row1.row1 row1 a1 b1 c1 row1 e1 a1
> row1.row4 row1 a1 b1 c1 row4 e4 a1
> row2.row5 row2 a2 b2 c2 row5 e5 a2
> row3.row1 row3 a1 b3 c3 row1 e1 a1
> row3.row4 row3 a1 b3 c3 row4 e4 a1
> {code}
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (HADOOP-2021) [Hbase Shell] Sort Join
Implementation
Posted by "Edward Yoon (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-2021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Edward Yoon updated HADOOP-2021:
--------------------------------
Status: Patch Available (was: In Progress)
> [Hbase Shell] Sort Join Implementation
> --------------------------------------
>
> Key: HADOOP-2021
> URL: https://issues.apache.org/jira/browse/HADOOP-2021
> Project: Hadoop
> Issue Type: Improvement
> Components: contrib/hbase
> Affects Versions: 0.15.0
> Environment: all environments
> Reporter: Edward Yoon
> Assignee: Edward Yoon
> Fix For: 0.16.0
>
> Attachments: 2021_v01.patch, 2021_v02.txt, 2021_v04.patch, 2021_v05, 2021_v05.patch, 2021_v06.patch
>
>
> If we don't have an index for a domain in the join, we can still improve on the nested-loop join using sort join.
> {code}
> R1 = table('movieLog_table');
> R2 = table('stockCompany_info');
> result = R1.join(R1.studioName = R2.corporation) and R2;
> {code}
> ----
> {code}
> r1
> a b c
> ======================
> row1 a1 b1 c1
> row2 a2 b2 c2
> row3 a1 b3 c3
> r2
> e f
> ==================
> row1 e1 a1
> row2 e2 f2
> row3 e3 f3
> row4 e4 a1
> row5 e5 a2
> r1 = table('r1');
> r2 = table('r2');
> r3 = r1.join(r1.a = r2.f) and r2;
> ---------------------------------------------
> temp table T : Sorted set by "f"
> row
> =============
> a1 row:row1
> row:row4
> a2 row:row5
> f2 row:row2
> f3 row:row3
> ---------------------
> r3
> r1.row a b c r2.row e f
> ===================================================
> row1.row1 row1 a1 b1 c1 row1 e1 a1
> row1.row4 row1 a1 b1 c1 row4 e4 a1
> row2.row5 row2 a2 b2 c2 row5 e5 a2
> row3.row1 row3 a1 b3 c3 row1 e1 a1
> row3.row4 row3 a1 b3 c3 row4 e4 a1
> {code}
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (HADOOP-2021) [Hbase Shell] Sort Join
Implementation
Posted by "Edward Yoon (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-2021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Edward Yoon updated HADOOP-2021:
--------------------------------
Affects Version/s: (was: 0.15.1)
0.16.0
ah.... changing the affects version/s.
> [Hbase Shell] Sort Join Implementation
> --------------------------------------
>
> Key: HADOOP-2021
> URL: https://issues.apache.org/jira/browse/HADOOP-2021
> Project: Hadoop
> Issue Type: Improvement
> Components: contrib/hbase
> Affects Versions: 0.16.0
> Environment: all environments
> Reporter: Edward Yoon
> Assignee: Edward Yoon
> Fix For: 0.16.0
>
> Attachments: 2021_v01.patch, 2021_v02.txt, 2021_v04.patch, 2021_v05, 2021_v05.patch, 2021_v06.patch
>
>
> If we don't have an index for a domain in the join, we can still improve on the nested-loop join using sort join.
> {code}
> R1 = table('movieLog_table');
> R2 = table('stockCompany_info');
> result = R1.join(R1.studioName = R2.corporation) and R2;
> {code}
> ----
> {code}
> r1
> a b c
> ======================
> row1 a1 b1 c1
> row2 a2 b2 c2
> row3 a1 b3 c3
> r2
> e f
> ==================
> row1 e1 a1
> row2 e2 f2
> row3 e3 f3
> row4 e4 a1
> row5 e5 a2
> r1 = table('r1');
> r2 = table('r2');
> r3 = r1.join(r1.a = r2.f) and r2;
> ---------------------------------------------
> temp table T : Sorted set by "f"
> row
> =============
> a1 row:row1
> row:row4
> a2 row:row5
> f2 row:row2
> f3 row:row3
> ---------------------
> r3
> r1.row a b c r2.row e f
> ===================================================
> row1.row1 row1 a1 b1 c1 row1 e1 a1
> row1.row4 row1 a1 b1 c1 row4 e4 a1
> row2.row5 row2 a2 b2 c2 row5 e5 a2
> row3.row1 row3 a1 b3 c3 row1 e1 a1
> row3.row4 row3 a1 b3 c3 row4 e4 a1
> {code}
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Issue Comment Edited: (HADOOP-2021) Sort Join Implementation
Posted by "Edward Yoon (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-2021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12543362 ]
udanax edited comment on HADOOP-2021 at 11/18/07 2:02 AM:
---------------------------------------------------------------
On second thought, I changed the result format.
And I had decision to store the join result as describe below.
Thanks, jimk and stack.
but, if you find an errors or inconsistencies, let me know.
{code}
r3
r1.row a b c r2.row e f
=============================================
row1.row1 row1 a1 b1 c1 row1 e1 a1
row1.row4 row1 a1 b1 c1 row2 e4 a1
{code}
was (Author: udanax):
On second thought, I changed the result format.
And I had decision to store the join result as describe below.
Thanks, jimk and stack.
but, if you find a errors or inconsistencies, let me know.
{code}
r3
r1.row a b c r2.row e f
=============================================
row1.row1 row1 a1 b1 c1 row1 e1 a1
row1.row4 row1 a1 b1 c1 row2 e4 a1
{code}
> Sort Join Implementation
> ------------------------
>
> Key: HADOOP-2021
> URL: https://issues.apache.org/jira/browse/HADOOP-2021
> Project: Hadoop
> Issue Type: Sub-task
> Components: contrib/hbase
> Affects Versions: 0.14.1
> Environment: all environments
> Reporter: Edward Yoon
> Priority: Minor
> Fix For: 0.16.0
>
> Attachments: 2021_v01.patch
>
>
> If we don't have an index for a domain in the join, we can still improve on the nested-loop join using sort join.
> {code}
> R1 = table('movieLog_table');
> R2 = table('stockCompany_info');
> result = R1.join(R1.studioName = R2.corporation) and R2;
> {code}
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (HADOOP-2021) Sort Join Implementation
Posted by "Edward Yoon (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-2021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Edward Yoon updated HADOOP-2021:
--------------------------------
Status: Open (was: Patch Available)
Canceling, it seems not registered.
> Sort Join Implementation
> ------------------------
>
> Key: HADOOP-2021
> URL: https://issues.apache.org/jira/browse/HADOOP-2021
> Project: Hadoop
> Issue Type: Sub-task
> Components: contrib/hbase
> Affects Versions: 0.14.1
> Environment: all environments
> Reporter: Edward Yoon
> Assignee: Edward Yoon
> Priority: Minor
> Fix For: 0.16.0
>
> Attachments: 2021_v01.patch, 2021_v02.txt, 2021_v04.patch, 2021_v05, 2021_v05.patch, 2021_v06.patch
>
>
> If we don't have an index for a domain in the join, we can still improve on the nested-loop join using sort join.
> {code}
> R1 = table('movieLog_table');
> R2 = table('stockCompany_info');
> result = R1.join(R1.studioName = R2.corporation) and R2;
> {code}
> ----
> {code}
> r1
> a b c
> ======================
> row1 a1 b1 c1
> row2 a2 b2 c2
> row3 a1 b3 c3
> r2
> e f
> ==================
> row1 e1 a1
> row2 e2 f2
> row3 e3 f3
> row4 e4 a1
> row5 e5 a2
> r1 = table('r1');
> r2 = table('r2');
> r3 = r1.join(r1.a = r2.f) and r2;
> ---------------------------------------------
> temp table T : Sorted set by "f"
> row
> =============
> a1 row:row1
> row:row4
> a2 row:row5
> f2 row:row2
> f3 row:row3
> ---------------------
> r3
> r1.row a b c r2.row e f
> ===================================================
> row1.row1 row1 a1 b1 c1 row1 e1 a1
> row1.row4 row1 a1 b1 c1 row4 e4 a1
> row2.row5 row2 a2 b2 c2 row5 e5 a2
> row3.row1 row3 a1 b3 c3 row1 e1 a1
> row3.row4 row3 a1 b3 c3 row4 e4 a1
> {code}
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (HADOOP-2021) Sort Join Implementation
Posted by "Edward Yoon (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-2021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Edward Yoon updated HADOOP-2021:
--------------------------------
Description:
If we don't have an index for a domain in the join, we can still improve on the nested-loop join using sort join.
{code}
R1 = table('movieLog_table');
R2 = table('stockCompany_info');
result = R1.join(R1.studioName = R2.corporation) and R2;
{code}
----
{code}
r1
a b c
======================
row1 a1 b1 c1
row2 a2 b2 c2
row3 a1 b3 c3
r2
e f
==================
row1 e1 a1
row2 e2 f2
row3 e3 f3
row4 e4 a1
row5 e5 a2
r1 = table('r1');
r2 = table('r2');
r3 = r1.join(r1.a = r2.f) and r2;
---------------------------------------------
temp table T : Sorted set by "f"
row e f
===========================
a1 row:row1 e1 a1
row:row4 e4
a2 row:row5 e5 a2
f2 row:row2 e2 f2
f3 row:row3 e3 f3
---------------------------------------------
r3
r1.row a b c r2.row e f
===================================================
row1.row1 row1 a1 b1 c1 row1 e1 a1
row1.row4 row1 a1 b1 c1 row4 e4 a1
row2.row5 row2 a2 b2 c2 row5 e5 a2
row3.row1 row3 a1 b3 c3 row1 e1 a1
row3.row4 row3 a1 b3 c3 row4 e4 a1
{code}
was:
If we don't have an index for a domain in the join, we can still improve on the nested-loop join using sort join.
1. make a sorted set temp file for sort join using MR job
2. make a new Relation table on hbase
{code}
R1 = table('movieLog_table');
R2 = table('stockCompany_info');
result = R1.join(R1.studioName = R2.corporation) and R2;
{code}
----
{code}
r1
a b c
======================
row1 a1 b1 c1
row2 a2 b2 c2
row3 a1 b3 c3
r2
e f
==================
row1 e1 a1
row2 e2 f2
row3 e3 f3
row4 e4 a1
row5 e5 a2
r1 = table('r1');
r2 = table('r2');
r3 = r1.join(r1.a = r2.f) and r2;
---------------------------------------------
temp table T : Sorted set by "f"
row e f
===========================
a1 row:row1 e1 a1
row:row4 e4
a2 row:row5 e5 a2
f2 row:row2 e2 f2
f3 row:row3 e3 f3
---------------------------------------------
r3
r1.row a b c r2.row e f
===================================================
row1.row1 row1 a1 b1 c1 row1 e1 a1
row1.row4 row1 a1 b1 c1 row4 e4 a1
row2.row5 row2 a2 b2 c2 row5 e5 a2
row3.row1 row3 a1 b3 c3 row1 e1 a1
row3.row4 row3 a1 b3 c3 row4 e4 a1
{code}
> Sort Join Implementation
> ------------------------
>
> Key: HADOOP-2021
> URL: https://issues.apache.org/jira/browse/HADOOP-2021
> Project: Hadoop
> Issue Type: Sub-task
> Components: contrib/hbase
> Affects Versions: 0.14.1
> Environment: all environments
> Reporter: Edward Yoon
> Priority: Minor
> Fix For: 0.16.0
>
> Attachments: 2021_v01.patch, 2021_v02.txt
>
>
> If we don't have an index for a domain in the join, we can still improve on the nested-loop join using sort join.
> {code}
> R1 = table('movieLog_table');
> R2 = table('stockCompany_info');
> result = R1.join(R1.studioName = R2.corporation) and R2;
> {code}
> ----
> {code}
> r1
> a b c
> ======================
> row1 a1 b1 c1
> row2 a2 b2 c2
> row3 a1 b3 c3
> r2
> e f
> ==================
> row1 e1 a1
> row2 e2 f2
> row3 e3 f3
> row4 e4 a1
> row5 e5 a2
> r1 = table('r1');
> r2 = table('r2');
> r3 = r1.join(r1.a = r2.f) and r2;
> ---------------------------------------------
> temp table T : Sorted set by "f"
> row e f
> ===========================
> a1 row:row1 e1 a1
> row:row4 e4
> a2 row:row5 e5 a2
> f2 row:row2 e2 f2
> f3 row:row3 e3 f3
> ---------------------------------------------
> r3
> r1.row a b c r2.row e f
> ===================================================
> row1.row1 row1 a1 b1 c1 row1 e1 a1
> row1.row4 row1 a1 b1 c1 row4 e4 a1
> row2.row5 row2 a2 b2 c2 row5 e5 a2
> row3.row1 row3 a1 b3 c3 row1 e1 a1
> row3.row4 row3 a1 b3 c3 row4 e4 a1
> {code}
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (HADOOP-2021) Sort Join Implementation
Posted by "Edward Yoon (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-2021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Edward Yoon updated HADOOP-2021:
--------------------------------
Status: Open (was: Patch Available)
> Sort Join Implementation
> ------------------------
>
> Key: HADOOP-2021
> URL: https://issues.apache.org/jira/browse/HADOOP-2021
> Project: Hadoop
> Issue Type: Sub-task
> Components: contrib/hbase
> Affects Versions: 0.14.1
> Environment: all environments
> Reporter: Edward Yoon
> Assignee: Edward Yoon
> Priority: Minor
> Fix For: 0.16.0
>
> Attachments: 2021_v01.patch, 2021_v02.txt, 2021_v04.patch, 2021_v05, 2021_v05.patch
>
>
> If we don't have an index for a domain in the join, we can still improve on the nested-loop join using sort join.
> {code}
> R1 = table('movieLog_table');
> R2 = table('stockCompany_info');
> result = R1.join(R1.studioName = R2.corporation) and R2;
> {code}
> ----
> {code}
> r1
> a b c
> ======================
> row1 a1 b1 c1
> row2 a2 b2 c2
> row3 a1 b3 c3
> r2
> e f
> ==================
> row1 e1 a1
> row2 e2 f2
> row3 e3 f3
> row4 e4 a1
> row5 e5 a2
> r1 = table('r1');
> r2 = table('r2');
> r3 = r1.join(r1.a = r2.f) and r2;
> ---------------------------------------------
> temp table T : Sorted set by "f"
> row e f
> ===========================
> a1 row:row1 e1 a1
> row:row4 e4
> a2 row:row5 e5 a2
> f2 row:row2 e2 f2
> f3 row:row3 e3 f3
> ---------------------------------------------
> r3
> r1.row a b c r2.row e f
> ===================================================
> row1.row1 row1 a1 b1 c1 row1 e1 a1
> row1.row4 row1 a1 b1 c1 row4 e4 a1
> row2.row5 row2 a2 b2 c2 row5 e5 a2
> row3.row1 row3 a1 b3 c3 row1 e1 a1
> row3.row4 row3 a1 b3 c3 row4 e4 a1
> {code}
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (HADOOP-2021) Sort Join Implementation
Posted by "Edward Yoon (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-2021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Edward Yoon updated HADOOP-2021:
--------------------------------
Attachment: 2021_v06.patch
-added some comments.
It was used little different way of join by mapreduce processing twice.
I think we discuss about parallel sort merge join later.
> Sort Join Implementation
> ------------------------
>
> Key: HADOOP-2021
> URL: https://issues.apache.org/jira/browse/HADOOP-2021
> Project: Hadoop
> Issue Type: Sub-task
> Components: contrib/hbase
> Affects Versions: 0.14.1
> Environment: all environments
> Reporter: Edward Yoon
> Assignee: Edward Yoon
> Priority: Minor
> Fix For: 0.16.0
>
> Attachments: 2021_v01.patch, 2021_v02.txt, 2021_v04.patch, 2021_v05, 2021_v05.patch, 2021_v06.patch
>
>
> If we don't have an index for a domain in the join, we can still improve on the nested-loop join using sort join.
> {code}
> R1 = table('movieLog_table');
> R2 = table('stockCompany_info');
> result = R1.join(R1.studioName = R2.corporation) and R2;
> {code}
> ----
> {code}
> r1
> a b c
> ======================
> row1 a1 b1 c1
> row2 a2 b2 c2
> row3 a1 b3 c3
> r2
> e f
> ==================
> row1 e1 a1
> row2 e2 f2
> row3 e3 f3
> row4 e4 a1
> row5 e5 a2
> r1 = table('r1');
> r2 = table('r2');
> r3 = r1.join(r1.a = r2.f) and r2;
> ---------------------------------------------
> temp table T : Sorted set by "f"
> row e f
> ===========================
> a1 row:row1 e1 a1
> row:row4 e4
> a2 row:row5 e5 a2
> f2 row:row2 e2 f2
> f3 row:row3 e3 f3
> ---------------------------------------------
> r3
> r1.row a b c r2.row e f
> ===================================================
> row1.row1 row1 a1 b1 c1 row1 e1 a1
> row1.row4 row1 a1 b1 c1 row4 e4 a1
> row2.row5 row2 a2 b2 c2 row5 e5 a2
> row3.row1 row3 a1 b3 c3 row1 e1 a1
> row3.row4 row3 a1 b3 c3 row4 e4 a1
> {code}
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (HADOOP-2021) Sort Join Implementation
Posted by "Edward Yoon (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-2021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12543331 ]
Edward Yoon commented on HADOOP-2021:
-------------------------------------
r1
a b c
================
row1 a1 b1 c1
row2 a2 b2 c2
r2
e f
============
row1 e1 a1
row2 e2 f2
row3 e3 f3
row4 e4 a1
{code}
r1 = table('r1');
r2 = table('r2');
r3 = r1.join(r1.a = r2.f) and r2;
{code}
r3
a b c row e f
=========================
row1 a1 b1 c1 row1 e1 a1
row1 a1 b1 c1 row4 e4 a1
> Sort Join Implementation
> ------------------------
>
> Key: HADOOP-2021
> URL: https://issues.apache.org/jira/browse/HADOOP-2021
> Project: Hadoop
> Issue Type: Sub-task
> Components: contrib/hbase
> Affects Versions: 0.14.1
> Environment: all environments
> Reporter: Edward Yoon
> Priority: Minor
> Fix For: 0.16.0
>
> Attachments: 2021_v01.patch
>
>
> If we don't have an index for a domain in the join, we can still improve on the nested-loop join using sort join.
> {code}
> R1 = table('movieLog_table');
> R2 = table('stockCompany_info');
> result = R1.join(R1.studioName = R2.corporation) and R2;
> {code}
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (HADOOP-2021) Sort Join Implementation
Posted by "Edward Yoon (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-2021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Edward Yoon updated HADOOP-2021:
--------------------------------
Description:
If we don't have an index for a domain in the join, we can still improve on the nested-loop join using sort join.
1. make a sorted set temp file for sort join using MR job
2. make a new Relation table on hbase
{code}
R1 = table('movieLog_table');
R2 = table('stockCompany_info');
result = R1.join(R1.studioName = R2.corporation) and R2;
{code}
----
{code}
r1
a b c
======================
row1 a1 b1 c1
row2 a2 b2 c2
row3 a1 b3 c3
r2
e f
==================
row1 e1 a1
row2 e2 f2
row3 e3 f3
row4 e4 a1
row5 e5 a2
r1 = table('r1');
r2 = table('r2');
r3 = r1.join(r1.a = r2.f) and r2;
---------------------------------------------
temp table T : Sorted set by "f"
row e f
===========================
a1 row:row1 e1 a1
row:row4 e4
a2 row:row5 e5 a2
f2 row:row2 e2 f2
f3 row:row3 e3 f3
---------------------------------------------
r3
r1.row a b c r2.row e f
===================================================
row1.row1 row1 a1 b1 c1 row1 e1 a1
row1.row4 row1 a1 b1 c1 row4 e4 a1
row2.row5 row2 a2 b2 c2 row5 e5 a2
row3.row1 row3 a1 b3 c3 row1 e1 a1
row3.row4 row3 a1 b3 c3 row4 e4 a1
{code}
was:
If we don't have an index for a domain in the join, we can still improve on the nested-loop join using sort join.
1. make a sorted set temp file for sort join using MR job on hdfs
2. make a new Relation table on hbase
{code}
R1 = table('movieLog_table');
R2 = table('stockCompany_info');
result = R1.join(R1.studioName = R2.corporation) and R2;
{code}
update description.
> Sort Join Implementation
> ------------------------
>
> Key: HADOOP-2021
> URL: https://issues.apache.org/jira/browse/HADOOP-2021
> Project: Hadoop
> Issue Type: Sub-task
> Components: contrib/hbase
> Affects Versions: 0.14.1
> Environment: all environments
> Reporter: Edward Yoon
> Priority: Minor
> Fix For: 0.16.0
>
> Attachments: 2021_v01.patch
>
>
> If we don't have an index for a domain in the join, we can still improve on the nested-loop join using sort join.
> 1. make a sorted set temp file for sort join using MR job
> 2. make a new Relation table on hbase
> {code}
> R1 = table('movieLog_table');
> R2 = table('stockCompany_info');
> result = R1.join(R1.studioName = R2.corporation) and R2;
> {code}
> ----
> {code}
> r1
> a b c
> ======================
> row1 a1 b1 c1
> row2 a2 b2 c2
> row3 a1 b3 c3
> r2
> e f
> ==================
> row1 e1 a1
> row2 e2 f2
> row3 e3 f3
> row4 e4 a1
> row5 e5 a2
> r1 = table('r1');
> r2 = table('r2');
> r3 = r1.join(r1.a = r2.f) and r2;
> ---------------------------------------------
> temp table T : Sorted set by "f"
> row e f
> ===========================
> a1 row:row1 e1 a1
> row:row4 e4
> a2 row:row5 e5 a2
> f2 row:row2 e2 f2
> f3 row:row3 e3 f3
> ---------------------------------------------
> r3
> r1.row a b c r2.row e f
> ===================================================
> row1.row1 row1 a1 b1 c1 row1 e1 a1
> row1.row4 row1 a1 b1 c1 row4 e4 a1
> row2.row5 row2 a2 b2 c2 row5 e5 a2
> row3.row1 row3 a1 b3 c3 row1 e1 a1
> row3.row4 row3 a1 b3 c3 row4 e4 a1
> {code}
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (HADOOP-2021) Sort Join Implementation
Posted by "Edward Yoon (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-2021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Edward Yoon updated HADOOP-2021:
--------------------------------
Description:
If we don't have an index for a domain in the join, we can still improve on the nested-loop join using sort join.
1. make a sorted set temp file for sort join using MR job
2. make a new Relation table
{code}
R1 = table('movieLog_table');
R2 = table('stockCompany_info');
result = R1.join(R1.studioName = R2.corporation) and R2;
{code}
was:
If we don't have an index for a domain in the join, we can still improve on the nested-loop join using sort join.
{code}
R1 = table('movieLog_table');
R2 = table('stockCompany_info');
result = R1.join(R1.studioName = R2.corporation) and R2;
{code}
update description.
> Sort Join Implementation
> ------------------------
>
> Key: HADOOP-2021
> URL: https://issues.apache.org/jira/browse/HADOOP-2021
> Project: Hadoop
> Issue Type: Sub-task
> Components: contrib/hbase
> Affects Versions: 0.14.1
> Environment: all environments
> Reporter: Edward Yoon
> Priority: Minor
> Fix For: 0.16.0
>
> Attachments: 2021_v01.patch
>
>
> If we don't have an index for a domain in the join, we can still improve on the nested-loop join using sort join.
> 1. make a sorted set temp file for sort join using MR job
> 2. make a new Relation table
> {code}
> R1 = table('movieLog_table');
> R2 = table('stockCompany_info');
> result = R1.join(R1.studioName = R2.corporation) and R2;
> {code}
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (HADOOP-2021) Sort Join Implementation
Posted by "Edward Yoon (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-2021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Edward Yoon updated HADOOP-2021:
--------------------------------
Attachment: 2021_v02.txt
I found a bug in TableMap, TableReduce classes.
Row iterator makes a lot of duplicated qualifier of column.
> Sort Join Implementation
> ------------------------
>
> Key: HADOOP-2021
> URL: https://issues.apache.org/jira/browse/HADOOP-2021
> Project: Hadoop
> Issue Type: Sub-task
> Components: contrib/hbase
> Affects Versions: 0.14.1
> Environment: all environments
> Reporter: Edward Yoon
> Priority: Minor
> Fix For: 0.16.0
>
> Attachments: 2021_v01.patch, 2021_v02.txt
>
>
> If we don't have an index for a domain in the join, we can still improve on the nested-loop join using sort join.
> 1. make a sorted set temp file for sort join using MR job
> 2. make a new Relation table on hbase
> {code}
> R1 = table('movieLog_table');
> R2 = table('stockCompany_info');
> result = R1.join(R1.studioName = R2.corporation) and R2;
> {code}
> ----
> {code}
> r1
> a b c
> ======================
> row1 a1 b1 c1
> row2 a2 b2 c2
> row3 a1 b3 c3
> r2
> e f
> ==================
> row1 e1 a1
> row2 e2 f2
> row3 e3 f3
> row4 e4 a1
> row5 e5 a2
> r1 = table('r1');
> r2 = table('r2');
> r3 = r1.join(r1.a = r2.f) and r2;
> ---------------------------------------------
> temp table T : Sorted set by "f"
> row e f
> ===========================
> a1 row:row1 e1 a1
> row:row4 e4
> a2 row:row5 e5 a2
> f2 row:row2 e2 f2
> f3 row:row3 e3 f3
> ---------------------------------------------
> r3
> r1.row a b c r2.row e f
> ===================================================
> row1.row1 row1 a1 b1 c1 row1 e1 a1
> row1.row4 row1 a1 b1 c1 row4 e4 a1
> row2.row5 row2 a2 b2 c2 row5 e5 a2
> row3.row1 row3 a1 b3 c3 row1 e1 a1
> row3.row4 row3 a1 b3 c3 row4 e4 a1
> {code}
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (HADOOP-2021) Sort Join Implementation
Posted by "Edward Yoon (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-2021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Edward Yoon updated HADOOP-2021:
--------------------------------
Attachment: 2021_v05.patch
{code}
r1 = table('r1');
r2 = table('r2');
r3 = r1.join(r1.a = r2.f and r1.a = 'a2') and r2;
save r3 into table('result');
{code}
> Sort Join Implementation
> ------------------------
>
> Key: HADOOP-2021
> URL: https://issues.apache.org/jira/browse/HADOOP-2021
> Project: Hadoop
> Issue Type: Sub-task
> Components: contrib/hbase
> Affects Versions: 0.14.1
> Environment: all environments
> Reporter: Edward Yoon
> Assignee: Edward Yoon
> Priority: Minor
> Fix For: 0.16.0
>
> Attachments: 2021_v01.patch, 2021_v02.txt, 2021_v04.patch, 2021_v05, 2021_v05.patch
>
>
> If we don't have an index for a domain in the join, we can still improve on the nested-loop join using sort join.
> {code}
> R1 = table('movieLog_table');
> R2 = table('stockCompany_info');
> result = R1.join(R1.studioName = R2.corporation) and R2;
> {code}
> ----
> {code}
> r1
> a b c
> ======================
> row1 a1 b1 c1
> row2 a2 b2 c2
> row3 a1 b3 c3
> r2
> e f
> ==================
> row1 e1 a1
> row2 e2 f2
> row3 e3 f3
> row4 e4 a1
> row5 e5 a2
> r1 = table('r1');
> r2 = table('r2');
> r3 = r1.join(r1.a = r2.f) and r2;
> ---------------------------------------------
> temp table T : Sorted set by "f"
> row e f
> ===========================
> a1 row:row1 e1 a1
> row:row4 e4
> a2 row:row5 e5 a2
> f2 row:row2 e2 f2
> f3 row:row3 e3 f3
> ---------------------------------------------
> r3
> r1.row a b c r2.row e f
> ===================================================
> row1.row1 row1 a1 b1 c1 row1 e1 a1
> row1.row4 row1 a1 b1 c1 row4 e4 a1
> row2.row5 row2 a2 b2 c2 row5 e5 a2
> row3.row1 row3 a1 b3 c3 row1 e1 a1
> row3.row4 row3 a1 b3 c3 row4 e4 a1
> {code}
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.