You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@carbondata.apache.org by "Nihal kumar ojha (Jira)" <ji...@apache.org> on 2021/08/03 08:43:00 UTC
[jira] [Commented] (CARBONDATA-4177) performence issue with Query

    [ https://issues.apache.org/jira/browse/CARBONDATA-4177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17392172#comment-17392172 ] 

Nihal kumar ojha commented on CARBONDATA-4177:
----------------------------------------------

Hi,

    Currently, carbondata doesn't support limit push down for either main table or MV table. It is only supported in the case of secondary index(SI). Because of this when we try to select rows with limit 10 in that case also carbon first fill vector of size 4096 rows and then sends it to spark. After that spark will apply the limit and give the result. So even for 10 rows we fetch 4096 rows and that is taking time. May be in future we can support limit pushdown for main table and MV and we will fetch only 10 rows at place of 4096 rows after that your query can get advantage.

> performence issue with Query
> ----------------------------
>
>                 Key: CARBONDATA-4177
>                 URL: https://issues.apache.org/jira/browse/CARBONDATA-4177
>             Project: CarbonData
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 2.0.1
>            Reporter: suyash yadav
>            Priority: Major
>             Fix For: 2.0.1
>
>
> Hi Team,Hi Team,
> We are working on a POC using carbondata 2.0.1 and have come across parformance issue.Below are the details:
> 1.Table creation query:
> ======================
> spark.sql("create table Flow_TS_1day_stats_16042021(start_time timestamp,end_time timestamp,source_ip_address string,destintion_ip_address string,appname string,protocol_name string,source_tos smallint,in_interface smallint,out_interface smallint,src_as bigint,dst_as bigint,source_mask smallint,destination_mask smallint, dst_tos smallint,input_pkt bigint,input_byt bigint,output_pkt bigint,output_byt bigint,source_port int,destination_port int) stored as carbondata TBLPROPERTIES ('local_dictionary_enable'='false')").show()
> TWO MVs are there on this table, Below are the queries for those MVs
> :===========================================================
> 1. Network MV
> ================
> spark.sql("create materialized view Network_Level_Agg_10min_MV_with_ip_15042021_again as select timeseries(end_time,'ten_minute') as end_time,source_ip_address, destintion_ip_address,appname,protocol_name,source_port,destination_port,source_tos,src_as,dst_as,sum(input_pkt) as input_pkt,sum(input_byt) as input_byt,sum(output_pkt) as output_pkt,sum(output_byt) as output_byt from Flow_TS_1day_stats_15042021_again group by timeseries(end_time,'ten_minute'),source_ip_address,destintion_ip_address, appname,protocol_name,source_port,destination_port,source_tos,src_as,dst_as order by input_pkt,input_byt,output_pkt,output_byt desc").show(false)
> 2. Interfae MV:
> ==================Interface :==============
> spark.sql("create materialized view Interface_Level_Agg_10min_MV_16042021 as select timeseries(end_time,'ten_minute') as end_time, source_ip_address,destintion_ip_address,appname,protocol_name,source_port,destination_port,source_tos,src_as,dst_as,in_interface,out_interface,sum(input_pkt) as input_pkt,sum(input_byt) as input_byt,sum(output_pkt) as output_pkt,sum(output_byt) as output_byt from Flow_TS_1day_stats_16042021 group by timeseries(end_time,'ten_minute'), source_ip_address,destintion_ip_address,appname,protocol_name,source_port,destination_port,source_tos,src_as,dst_as,in_interface,out_interface order by input_pkt,input_byt,output_pkt,output_byt desc").show(false)
> +*We are firing below query for fethcing data which is taking almost 10 seconds:*+
> *Select appname,input_byt from Flow_TS_1day_stats_16042021 where end_time >= '2021-03-02 00:00:00' and end_time < '2021-03-03 00:00:00' group by appname,input_byt order by input_byt desc LIMIT 10*
>  
> The above query is only fetching 10 records but it is taking almost 10 seconds to complete.
> Could you please review above schemas and help us to understand how can we get some improvement in the qury execution time. We are expectingt he response should be in subseconds.
> Table Name : RAW Table (1 Day - 300K/Sec)#Records : 25920000000
> RegardsSuyash Yadav                          



--
This message was sent by Atlassian Jira
(v8.3.4#803005)