You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by "jiang hehui (JIRA)" <ji...@apache.org> on 2016/06/21 09:01:09 UTC
[jira] [Updated] (HADOOP-13304) distributed database for store ,
mapreduce for compute
[ https://issues.apache.org/jira/browse/HADOOP-13304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
jiang hehui updated HADOOP-13304:
---------------------------------
Description:
in hadoop ,hdfs is responsible for store , mapreduce is responsible for compute .
my idea is that data are stored in distributed database , data compute is like mapreduce.
h2. how to do ?
* insert:
using two-phase commit ,according to the split policy ,just execute insert in nodes
* delete:
using two-phase commit ,according to the split policy ,just execute delete in nodes
* update:
using two-phase commit, according to the split policy, if record node does not change ,just execute update in nodes, if record node change, first delete old value in source node , and insert new value in destination node .
* select:
** simple select (like data just in one node , or data fusion across multi nodes not need)is just the same like standalone database server;
** complex select (like distinct , group by, order by, sub query, join across multi nodes),we call a job
{panel}
{color:red}job are parsed into stages , stages have lineage , all stages in a job make up dag( Directed Acyclic Graph ) ,every stage contains mapsql ,shuffle, reducesql .
when receive request sql, according to metadata ,generate the execution plan which contain the dag , including stage and mapsql ,shuffle, reducesql in each stage; then just execute the plan , and return result to client.
as in spark , it is the same ; rdd is table , job is job;
as mapreduce in hadoop, it is the same ; mapsql is map , shuffle is shuffle , reducesql is reduce.
{color}
{panel}
h2. architecture:
!http://images2015.cnblogs.com/blog/439702/201606/439702-20160621124133334-32823985.png!
* client : user interface
* master : master like nameserver in hdfs
* meta database : contain the base information about system , nodes , tables and so on
* store node : database node where data is stored , insert,delete,update is always executed on
* calculate node : where execure select , source data is in store nodes , then other task is run on calculate node . calculae node may be the same as store node in practice
h2. example:
{panel}
select age,count(u_id) v from tab_user_info t where u_reg_dt>=? and u_reg_dt<=? group by age
execution plan may be:
stage0:
mapsql:
select age,count(u_id) v from tab_user_info t where u_reg_dt>=? and u_reg_dt<=? group by age
shuffle:
shuffle by age with range policy,
for example ,if number of reduce node is N , then every node has (max(u_id)-min(u_id))/N record ,
reduce node have id , node with small id store data with small range of age , so we can group by in each node
reducesql:
select age,sum(v) from t where group by age
note:
we must execute group by on reduce node because of data coming from different mapsql need to be aggregated
{panel}
{panel}
select t1.u_id,t1.u_name,t2.login_product
from tab_user_info t1 join tab_login_info t2
on (t1.u_id=t2.u_id and t1.u_reg_dt>=? and t1.u_reg_dt<=?)
execution plan may be:
stage0:
mapsql:
select u_id,u_name from tab_user_info t where u_reg_dt>=? and t1.u_reg_dt<=? ;
select u_id, login_product from tab_login_info t ;
shuffle:
shuffle by u_id with range policy,
for example ,if number of reduce node is N , then every node has (max(u_id)-min(u_id))/N record ,
reduce node have id , node with small id store data with small range of u_id , so we can join in each node
reducesql:
select t1.u_id,t1.u_name,t2.login_product
from tab_user_info t1 join tab_login_info t2
on (t1.u_id=t2.u_id)
note:
because of join ,each table need to be tagged so that reduce can determine each record belongs to which table
{panel}
was:
in hadoop ,hdfs is responsible for store , mapreduce is responsible for compute .
my idea is that data are stored in distributed database , data compute is like mapreduce.
!http://images2015.cnblogs.com/blog/439702/201606/439702-20160621124133334-32823985.png!
* insert:
using two-phase commit ,according to the split policy ,just execute insert in nodes
* delete:
using two-phase commit ,according to the split policy ,just execute delete in nodes
* update:
using two-phase commit, according to the split policy, if record node does not change ,just execute update in nodes, if record node change, first delete old value in source node , and insert new value in destination node .
* select:
** simple select (like data just in one node , or data fusion across multi nodes not need)is just the same like standalone database server;
** complex select (like distinct , group by, order by, sub query, join across multi nodes),we call a job
{panel}
{color:red}job are parsed into stages , stages have lineage , all stages in a job make up dag( Directed Acyclic Graph ) ,every stage contains mapsql ,shuffle, reducesql .
when receive request sql, according to metadata ,generate the execution plan which contain the dag , including stage and mapsql ,shuffle, reducesql in each stage; then just execute the plan , and return result to client.
as in spark , it is the same ; rdd is table , job is job;
as mapreduce in hadoop, it is the same ; mapsql is map , shuffle is shuffle , reducesql is reduce.
{color}
{panel}
> distributed database for store , mapreduce for compute
> ------------------------------------------------------
>
> Key: HADOOP-13304
> URL: https://issues.apache.org/jira/browse/HADOOP-13304
> Project: Hadoop Common
> Issue Type: New Feature
> Components: fs
> Affects Versions: 2.6.4
> Reporter: jiang hehui
>
> in hadoop ,hdfs is responsible for store , mapreduce is responsible for compute .
> my idea is that data are stored in distributed database , data compute is like mapreduce.
> h2. how to do ?
> * insert:
> using two-phase commit ,according to the split policy ,just execute insert in nodes
> * delete:
> using two-phase commit ,according to the split policy ,just execute delete in nodes
> * update:
> using two-phase commit, according to the split policy, if record node does not change ,just execute update in nodes, if record node change, first delete old value in source node , and insert new value in destination node .
> * select:
> ** simple select (like data just in one node , or data fusion across multi nodes not need)is just the same like standalone database server;
> ** complex select (like distinct , group by, order by, sub query, join across multi nodes),we call a job
> {panel}
> {color:red}job are parsed into stages , stages have lineage , all stages in a job make up dag( Directed Acyclic Graph ) ,every stage contains mapsql ,shuffle, reducesql .
> when receive request sql, according to metadata ,generate the execution plan which contain the dag , including stage and mapsql ,shuffle, reducesql in each stage; then just execute the plan , and return result to client.
> as in spark , it is the same ; rdd is table , job is job;
> as mapreduce in hadoop, it is the same ; mapsql is map , shuffle is shuffle , reducesql is reduce.
> {color}
> {panel}
> h2. architecture:
> !http://images2015.cnblogs.com/blog/439702/201606/439702-20160621124133334-32823985.png!
> * client : user interface
> * master : master like nameserver in hdfs
> * meta database : contain the base information about system , nodes , tables and so on
> * store node : database node where data is stored , insert,delete,update is always executed on
> * calculate node : where execure select , source data is in store nodes , then other task is run on calculate node . calculae node may be the same as store node in practice
> h2. example:
> {panel}
> select age,count(u_id) v from tab_user_info t where u_reg_dt>=? and u_reg_dt<=? group by age
> execution plan may be:
> stage0:
> mapsql:
> select age,count(u_id) v from tab_user_info t where u_reg_dt>=? and u_reg_dt<=? group by age
> shuffle:
> shuffle by age with range policy,
> for example ,if number of reduce node is N , then every node has (max(u_id)-min(u_id))/N record ,
> reduce node have id , node with small id store data with small range of age , so we can group by in each node
>
> reducesql:
> select age,sum(v) from t where group by age
> note:
> we must execute group by on reduce node because of data coming from different mapsql need to be aggregated
> {panel}
> {panel}
> select t1.u_id,t1.u_name,t2.login_product
> from tab_user_info t1 join tab_login_info t2
> on (t1.u_id=t2.u_id and t1.u_reg_dt>=? and t1.u_reg_dt<=?)
> execution plan may be:
> stage0:
> mapsql:
> select u_id,u_name from tab_user_info t where u_reg_dt>=? and t1.u_reg_dt<=? ;
> select u_id, login_product from tab_login_info t ;
> shuffle:
> shuffle by u_id with range policy,
> for example ,if number of reduce node is N , then every node has (max(u_id)-min(u_id))/N record ,
> reduce node have id , node with small id store data with small range of u_id , so we can join in each node
>
> reducesql:
> select t1.u_id,t1.u_name,t2.login_product
> from tab_user_info t1 join tab_login_info t2
> on (t1.u_id=t2.u_id)
> note:
> because of join ,each table need to be tagged so that reduce can determine each record belongs to which table
> {panel}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org