You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@carbondata.apache.org by geda <be...@gmail.com> on 2016/12/14 15:10:13 UTC

carbondata test join question

hello
i  want to test orc ,carbon , which is faster
test on yarn with 8 executors, each executor 4G,each 2 core
spark1.6 ,carbon2.0
sql : join  3 table 
A :300W row,5GB
B:13W row,30MB
C:7W row,10MB
like this:
select b.id ,b.d_name ,a.v_no,count(*) o_count  from    a left join  b on
a.d_id=b.id left join    c on  c.v_no=a.v_no  where  date(a.create_time)>=
'2016-07-01' and  date(a.create_time)<= '2016-09-02' group by  b.id ,
b.d_name, a.v_no  having  o_count> 30 order by b.id desc 
use context 
cc.sql($SQL).show() : carbondata :run 5times  avg time :7.3s
hiveContext.sql($SQL).show() : ORC : run 5times  avg time:5.3s
i find from DAG ,carbon has a more job ,do carbon decode ,finish this job
cause 2-3s spend
if strip this job ,carbon and orc  use time more or less the same
i want to know how to strip  the last stage or how to tune sql like this
.Thanks
<http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/file/n4440/jobtrace.png> 
<http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/file/n4440/laststage-carbon.png> 
<http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/file/n4440/laststage-orc.png> 






--
View this message in context: http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/carbondata-test-join-question-tp4440.html
Sent from the Apache CarbonData Mailing List archive mailing list archive at Nabble.com.

Re: carbondata test join question

Posted by Liang Chen <ch...@gmail.com>.

Hi geda

As we know, CarbonData's key feature is index.  
About tuning SQL, you can refer to  :
https://cwiki.apache.org/confluence/display/CARBONDATA/FAQ

Regards
Liang







--
View this message in context: http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/carbondata-test-join-question-tp4440p4446.html
Sent from the Apache CarbonData Mailing List archive mailing list archive at Nabble.com.

Re: carbondata test join question

Posted by Jay <25...@qq.com>.

hi, geda
 can u share ur create ddl? 
 some suggestion: for that filter field (like id), u can try to put in left column and use dictionary_include or exclude to make it dimension. if  cardinality more than 100 thousand, u can try to make it no dictionary.
 as for ur question, if all the dimensions make no dictionary, there will be no decode part.


thanks
Jay





------------------ Original ------------------
From:  "geda";<be...@gmail.com>;
Date:  Wed, Dec 14, 2016 11:10 PM
To:  "dev"<de...@carbondata.incubator.apache.org>; 

Subject:  carbondata test join  question



hello
i  want to test orc ,carbon , which is faster
test on yarn with 8 executors, each executor 4G,each 2 core
spark1.6 ,carbon2.0
sql : join  3 table 
A :300W row,5GB
B:13W row,30MB
C:7W row,10MB
like this:
select b.id ,b.d_name ,a.v_no,count(*) o_count  from    a left join  b on
a.d_id=b.id left join    c on  c.v_no=a.v_no  where  date(a.create_time)>=
'2016-07-01' and  date(a.create_time)<= '2016-09-02' group by  b.id ,
b.d_name, a.v_no  having  o_count> 30 order by b.id desc 
use context 
cc.sql($SQL).show() : carbondata :run 5times  avg time :7.3s
hiveContext.sql($SQL).show() : ORC : run 5times  avg time:5.3s
i find from DAG ,carbon has a more job ,do carbon decode ,finish this job
cause 2-3s spend
if strip this job ,carbon and orc  use time more or less the same
i want to know how to strip  the last stage or how to tune sql like this
.Thanks
<http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/file/n4440/jobtrace.png> 
<http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/file/n4440/laststage-carbon.png> 
<http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/file/n4440/laststage-orc.png> 






--
View this message in context: http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/carbondata-test-join-question-tp4440.html
Sent from the Apache CarbonData Mailing List archive mailing list archive at Nabble.com.