You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by "jiang hehui (JIRA)" <ji...@apache.org> on 2016/06/21 09:01:09 UTC

[jira] [Updated] (HADOOP-13304) distributed database for store , mapreduce for compute

     [ https://issues.apache.org/jira/browse/HADOOP-13304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

jiang hehui updated HADOOP-13304:
---------------------------------
    Description: 
in hadoop ,hdfs is responsible for store , mapreduce is responsible for compute .
my idea is that data are stored in distributed database , data compute is like mapreduce.

h2. how to do ?

* insert: 
using two-phase commit ,according to the split policy ,just execute insert in nodes

* delete: 
using two-phase commit ,according to the split policy ,just execute delete in nodes

* update:
using two-phase commit, according to the split policy, if record node does not change ,just execute update in nodes, if record node change, first delete old value in source node , and insert new value in destination node .
* select:
** simple select (like data just in one node , or data fusion across multi nodes not need)is just the same like standalone database server;
** complex select (like distinct , group by, order by, sub query, join across multi nodes),we call a job 
{panel}
{color:red}job are parsed into stages , stages have lineage , all stages in a job make up dag( Directed Acyclic Graph ) ,every stage contains mapsql ,shuffle, reducesql .
when receive request sql, according to metadata ,generate the execution plan which contain the dag , including stage and mapsql ,shuffle, reducesql in each stage; then just execute the plan , and return result to client.

as in spark , it is the same ; rdd is table , job is job;
as mapreduce in hadoop, it is the same ; mapsql is map , shuffle is shuffle , reducesql is reduce.
{color}
{panel}


h2. architecture:
!http://images2015.cnblogs.com/blog/439702/201606/439702-20160621124133334-32823985.png!

* client : user interface 
* master : master like nameserver in hdfs
* meta database : contain the base information about system , nodes , tables and so on 
* store node :  database node where data is stored , insert,delete,update is always executed on 
* calculate node : where execure select , source data is in store nodes , then other task is run on calculate node . calculae node may be the same as store node in practice


h2. example:

{panel}
select age,count(u_id) v from tab_user_info t where u_reg_dt>=? and u_reg_dt<=? group by age


execution plan may be:
stage0:
	mapsql:
		select age,count(u_id) v from tab_user_info t where u_reg_dt>=? and u_reg_dt<=? group by age

	shuffle: 
		shuffle by age with range policy, 
		for example ,if number of reduce node is N , then every node has (max(u_id)-min(u_id))/N record , 
		reduce node have id , node with small id store data with small range of age , so we can group by in each node
		
	reducesql: 
		select age,sum(v) from t where group by age

note:
we must execute group by on reduce node because of data coming from different mapsql need to be aggregated



{panel}


{panel}

select t1.u_id,t1.u_name,t2.login_product 
from tab_user_info t1 join tab_login_info t2 
on (t1.u_id=t2.u_id and t1.u_reg_dt>=? and t1.u_reg_dt<=?)

execution plan may be:

stage0:
	mapsql:
		select u_id,u_name from tab_user_info t where u_reg_dt>=? and t1.u_reg_dt<=? ;
		select u_id, login_product from tab_login_info t ;


	shuffle: 
		shuffle by u_id with range policy, 
		for example ,if number of reduce node is N , then every node has (max(u_id)-min(u_id))/N record , 
		reduce node have id , node with small id store data with small range of u_id , so we can join in each node
		
	reducesql: 
		select t1.u_id,t1.u_name,t2.login_product 
		from tab_user_info t1 join tab_login_info t2 
		on (t1.u_id=t2.u_id)


note:
because of join ,each table need to be tagged so that reduce can determine each record belongs to which table


{panel}

  was:
in hadoop ,hdfs is responsible for store , mapreduce is responsible for compute .
my idea is that data are stored in distributed database , data compute is like mapreduce.

!http://images2015.cnblogs.com/blog/439702/201606/439702-20160621124133334-32823985.png!

* insert: 
using two-phase commit ,according to the split policy ,just execute insert in nodes

* delete: 
using two-phase commit ,according to the split policy ,just execute delete in nodes

* update:
using two-phase commit, according to the split policy, if record node does not change ,just execute update in nodes, if record node change, first delete old value in source node , and insert new value in destination node .
* select:
** simple select (like data just in one node , or data fusion across multi nodes not need)is just the same like standalone database server;
** complex select (like distinct , group by, order by, sub query, join across multi nodes),we call a job 
{panel}
{color:red}job are parsed into stages , stages have lineage , all stages in a job make up dag( Directed Acyclic Graph ) ,every stage contains mapsql ,shuffle, reducesql .
when receive request sql, according to metadata ,generate the execution plan which contain the dag , including stage and mapsql ,shuffle, reducesql in each stage; then just execute the plan , and return result to client.

as in spark , it is the same ; rdd is table , job is job;
as mapreduce in hadoop, it is the same ; mapsql is map , shuffle is shuffle , reducesql is reduce.
{color}
{panel}


> distributed database for store , mapreduce for compute
> ------------------------------------------------------
>
>                 Key: HADOOP-13304
>                 URL: https://issues.apache.org/jira/browse/HADOOP-13304
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: fs
>    Affects Versions: 2.6.4
>            Reporter: jiang hehui
>
> in hadoop ,hdfs is responsible for store , mapreduce is responsible for compute .
> my idea is that data are stored in distributed database , data compute is like mapreduce.
> h2. how to do ?
> * insert: 
> using two-phase commit ,according to the split policy ,just execute insert in nodes
> * delete: 
> using two-phase commit ,according to the split policy ,just execute delete in nodes
> * update:
> using two-phase commit, according to the split policy, if record node does not change ,just execute update in nodes, if record node change, first delete old value in source node , and insert new value in destination node .
> * select:
> ** simple select (like data just in one node , or data fusion across multi nodes not need)is just the same like standalone database server;
> ** complex select (like distinct , group by, order by, sub query, join across multi nodes),we call a job 
> {panel}
> {color:red}job are parsed into stages , stages have lineage , all stages in a job make up dag( Directed Acyclic Graph ) ,every stage contains mapsql ,shuffle, reducesql .
> when receive request sql, according to metadata ,generate the execution plan which contain the dag , including stage and mapsql ,shuffle, reducesql in each stage; then just execute the plan , and return result to client.
> as in spark , it is the same ; rdd is table , job is job;
> as mapreduce in hadoop, it is the same ; mapsql is map , shuffle is shuffle , reducesql is reduce.
> {color}
> {panel}
> h2. architecture:
> !http://images2015.cnblogs.com/blog/439702/201606/439702-20160621124133334-32823985.png!
> * client : user interface 
> * master : master like nameserver in hdfs
> * meta database : contain the base information about system , nodes , tables and so on 
> * store node :  database node where data is stored , insert,delete,update is always executed on 
> * calculate node : where execure select , source data is in store nodes , then other task is run on calculate node . calculae node may be the same as store node in practice
> h2. example:
> {panel}
> select age,count(u_id) v from tab_user_info t where u_reg_dt>=? and u_reg_dt<=? group by age
> execution plan may be:
> stage0:
> 	mapsql:
> 		select age,count(u_id) v from tab_user_info t where u_reg_dt>=? and u_reg_dt<=? group by age
> 	shuffle: 
> 		shuffle by age with range policy, 
> 		for example ,if number of reduce node is N , then every node has (max(u_id)-min(u_id))/N record , 
> 		reduce node have id , node with small id store data with small range of age , so we can group by in each node
> 		
> 	reducesql: 
> 		select age,sum(v) from t where group by age
> note:
> we must execute group by on reduce node because of data coming from different mapsql need to be aggregated
> {panel}
> {panel}
> select t1.u_id,t1.u_name,t2.login_product 
> from tab_user_info t1 join tab_login_info t2 
> on (t1.u_id=t2.u_id and t1.u_reg_dt>=? and t1.u_reg_dt<=?)
> execution plan may be:
> stage0:
> 	mapsql:
> 		select u_id,u_name from tab_user_info t where u_reg_dt>=? and t1.u_reg_dt<=? ;
> 		select u_id, login_product from tab_login_info t ;
> 	shuffle: 
> 		shuffle by u_id with range policy, 
> 		for example ,if number of reduce node is N , then every node has (max(u_id)-min(u_id))/N record , 
> 		reduce node have id , node with small id store data with small range of u_id , so we can join in each node
> 		
> 	reducesql: 
> 		select t1.u_id,t1.u_name,t2.login_product 
> 		from tab_user_info t1 join tab_login_info t2 
> 		on (t1.u_id=t2.u_id)
> note:
> because of join ,each table need to be tagged so that reduce can determine each record belongs to which table
> {panel}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org