You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@carbondata.apache.org by 马云 <si...@163.com> on 2017/09/23 16:26:54 UTC
[DISCUSSION] optimization of OrderBy sorted columns + Limit Query
Hi Carbon Dev,
We faced slow sqls when apply carbondata in business system.
In our custom info management system, only 2 million data in carbondata.
we need to fetch data sorting by the costum’s phone number field and paginate data on the web page.
For example, when loading our system index web page, it will execute the below similar sql
Select phone_number, other_field1, other_field2 , other_field3 , other_field4 from custominfo order by phone_number limit 10;
In our prod env, it takes about 4 seconds to execute this sql, it is slow for only 2 million data system.
In another car info management system, it has 0.1 billion data in carbondata.
need to fetch data by sorting the car id field and business date, and paginate data on the web page.
Similar sql as below
Select car_id, business_date, other_field1, other_field2, other_field3, other_field4 from carinfo order by car_info, business_date limit 10;
I done test in my local , it takes about 30 seconds to execute the sql.
In short, the more data, the worse performance even we just need the top 10 since it used full scan for orderby operation
Actually a few months ago, I have come up with the optimization plan for orderby+limit,
Please refer to
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/Optimize-Order-By-Limit-Query-td9764.html#a9860 <http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/Optimize-Order-By-Limit-Query-td9764.html#a9860>
mainly, pushdown order by +limit to carbon scan, leverage the sort column’s order stored feature to get topN sorted data in each block
and reduce io scan. From test result, it can improve about 80% performance.
In previous discussion, we have the below conclusion.
The optimization solution has below limitations
1. It cannot work for dictionary columns. As there is no guarantee that
dictionary allocation is in sorted order.
2. It only can work when order by one dimension + limit or order by prefix columns of MDK
3. It can’t optimize when orderby measures +limit
This time I think we can optimize the below cases, please give your suggestion
1. Order by prefix columns of sort columns if no dictionary encode + limit
2. Order by a dimension with no dictionary encode + limit
for example date, string type column
3. Order by a dimension with dictionary encode if creating table with a pre-defined dictionaries + limit
4. Order by a number field which is added to sort colums + limit
for example int, bitint type column
5. Order by dimension with dictionary encode + limit
In this case, there is no guarantee that
dictionary allocation is in sorted order.
So maybe we can calculate the dictionary’s order according to the original value’s order in memory
And get top N according to the original value’s order
For example in table t3 having a country field with dictionary encode,
Country dictionary
Australia 6
Canada 2
China 3
Uk 4
Usa 5
In Blocklet 1 dict value from 2 to 3 and have 32000 data
In Blocklet 2 dict value from 3 to 5 and have 32000 data
In Blocklet 3 dict value from 6 to 6 and have 32000 data
for the below Sql, apparently we only need to process blocklet3, it can reduce process time for blocklet1 and blocklet2
Select country , name from t3 order by country limit 100;
Thanks
Jarck
Re: [DISCUSSION] optimization of OrderBy sorted columns + Limit
Query
Posted by ravipesala <ra...@gmail.com>.
Hi,
As I understand as per your business scenario you require to sort on
dimension and apply a limit on it for pagination on your web application. It
is a very specific scenario for your case to handle. It is not even topN
queries as it is not applying on measures. If we push down this type of
specific scenarios to carbon we will have maintainability issues in
long-term.
It is recommended that these type of optimizations can be done in the
planning phase or integration layer and not to the level of carbon layer.
Regards,
Ravindra
--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
Re: [DISCUSSION] optimization of OrderBy sorted columns + Limit Query
Posted by 马云 <si...@163.com>.
Hi Liang,
yes, your right, I used the dictionary to do the limit and this solution can’t make sure the data correctness.
But, I also done order by test on date field which is the first column of MDK key, date field used the direct dictionary encode, it should can make sure the data correctness.
after test , it also can improve more than 80% performance
> 在 2017年9月29日,下午3:34,Liang Chen <ch...@gmail.com> 写道:
>
> Hi Jarck
>
> Did this solution use dictionary to do limit , right ? this solution can't
> make sure the data correctness
> -------------------------------------------------------------------------------
> Use orderby +limit optimized carbondata1.2 master code + spark1.6.3
>
> @Ravindra @Jarck : let us discuss it, see if can find out the common
> solution,
> Proposal 1 : Implement TopN solution.
> Proposal 2 : Decoupled plug-in solution, continue to use Jarck's code,
> just need change order by from dictionary value to actual dimensional value.
>
> Regards
> Liang
>
> simafengyun wrote
>> Recently , I used the latest code done test as below
>> 1. Create Table:
>> CREATE TABLE rx5_tbox_parquet_all(
>> carid STRING,
>> inputstime TIMESTAMP,
>> carsyspwrmod INT,
>> cardofrontpas INT,
>> cardofrontdrv INT,
>> cardorearleft INT,
>> cardorearright INT,
>> carbonnet INT,
>> carboot INT,
>> carwinfrontleft INT,
>> carwinrearleft INT,
>> carwinfrontright INT,
>> carwinrearright INT,
>> carsunroof INT,
>> carcsactive INT,
>> carcsenabled INT,
>> carseatbeltdrv INT
>> )
>> STORED BY 'carbondata'
>> TBLPROPERTIES('SORT_COLUMNS'='carid',
>> 'DICTIONARY_INCLUDE'='carid')
>>
>> 2. Load 0.1 billion data
>>
>> 3. Run the below sql
>>
>> select
>> carid,
>> inputstime,
>> carsyspwrmod,
>> cardofrontpas,
>> cardofrontdrv,
>> cardorearleft,
>> cardorearright,
>> carbonnet,
>> carboot,
>> carwinfrontleft,
>> carwinrearleft,
>> carwinfrontright,
>> carwinrearright
>> from rx5_tbox_parquet_all2
>> order by carid
>> limit 10
>>
>>
>> Use carbondata1.2 master code + spark2.1 to run
>> |carid
>> |inputstime|carsyspwrmod|cardofrontpas|cardofrontdrv|cardorearleft|cardorearright|carbonnet|carboot|carwinfrontleft|carwinrearleft|carwinfrontright|carwinrearright|
>> +-----------------+--------+------------+---------------+---------------+---------------+----------------+---------+-------+------------------+-----------------+-------------------+------------------+
>> |LSJA24790HS020662|null |2 |0 |0
>> |0
>> |0 |0 |0 |0 |0
>>
>> |0 |0 |
>> |LSJA24790HS020662|null |2 |0 |0
>> |0
>> |0 |0 |0 |0 |0
>>
>> |0 |0 |
>> |LSJA24790HS020662|null |2 |0 |0
>> |0
>> |0 |0 |0 |0 |0
>>
>> |0 |0 |
>> |LSJA24790HS020662|null |2 |0 |0
>> |0
>> |0 |0 |0 |0 |0
>>
>> |0 |0 |
>> |LSJA24790HS020662|null |2 |0 |0
>> |0
>> |0 |0 |0 |0 |0
>>
>> |0 |0 |
>> |LSJA24790HS020662|null |2 |0 |0
>> |0
>> |0 |0 |0 |0 |0
>>
>> |0 |0 |
>> |LSJA24790HS020662|null |2 |0 |0
>> |0
>> |0 |0 |0 |0 |0
>>
>> |0 |0 |
>> |LSJA24790HS020662|null |2 |0 |0
>> |0
>> |0 |0 |0 |0 |0
>>
>> |0 |0 |
>> |LSJA24790HS020662|null |2 |0 |0
>> |0
>> |0 |0 |0 |0 |0
>>
>> |0 |0 |
>> |LSJA24790HS020662|null |2 |0 |0
>> |0
>> |0 |0 |0 |0 |0
>>
>> |0 |0 |
>> +-----------------+--------+------------+---------------+---------------+---------------+----------------+---------+-------+------------------+-----------------+-------------------+------------------+
>>
>> limit 10 query time: 28777 milliseconds
>>
>> Use orderby +limit optimized carbondata1.2 master code + spark1.6.3 to run
>>
>> +-----------------+--------+------------+---------------+---------------+---------------+----------------+---------+-------+------------------+-----------------+-------------------+------------------+
>> |carid
>> |inputstime|carsyspwrmod|cardofrontpas|cardofrontdrv|cardorearleft|cardorearright|carbonnet|carboot|carwinfrontleft|carwinrearleft|carwinfrontright|carwinrearright|
>> +-----------------+--------+------------+---------------+---------------+---------------+----------------+---------+-------+------------------+-----------------+-------------------+------------------+
>> |LSJA24790HS020662|null |2 |0 |0
>> |0
>> |0 |0 |0 |0 |0
>>
>> |0 |0 |
>> |LSJA24790HS020662|null |2 |0 |0
>> |0
>> |0 |0 |0 |0 |0
>>
>> |0 |0 |
>> |LSJA24790HS020662|null |2 |0 |0
>> |0
>> |0 |0 |0 |0 |0
>>
>> |0 |0 |
>> |LSJA24790HS020662|null |2 |0 |0
>> |0
>> |0 |0 |0 |0 |0
>>
>> |0 |0 |
>> |LSJA24790HS020662|null |2 |0 |0
>> |0
>> |0 |0 |0 |0 |0
>>
>> |0 |0 |
>> |LSJA24790HS020662|null |2 |0 |0
>> |0
>> |0 |0 |0 |0 |0
>>
>> |0 |0 |
>> |LSJA24790HS020662|null |2 |0 |0
>> |0
>> |0 |0 |0 |0 |0
>>
>> |0 |0 |
>> |LSJA24790HS020662|null |2 |0 |0
>> |0
>> |0 |0 |0 |0 |0
>>
>> |0 |0 |
>> |LSJA24790HS020662|null |2 |0 |0
>> |0
>> |0 |0 |0 |0 |0
>>
>> |0 |0 |
>> |LSJA24790HS020662|null |2 |0 |0
>> |0
>> |0 |0 |0 |0 |0
>>
>> |0 |0 |
>> +-----------------+--------+------------+---------------+---------------+---------------+----------------+---------+-------+------------------+-----------------+-------------------+------------------+
>>
>> limit 10 query time: 1640 milliseconds
>>
>> Apparently, after optimization, even I use spark1.6.3, it also improved
>> 90%
>> performance
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> Thanks
>> 马云
>>
>>
>>
>>
>> --
>> Sent from:
>> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>
>
>
>
>
> --
> Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ <http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/>
Re: [DISCUSSION] optimization of OrderBy sorted columns + Limit
Query
Posted by Liang Chen <ch...@gmail.com>.
Hi Jarck
Did this solution use dictionary to do limit , right ? this solution can't
make sure the data correctness
-------------------------------------------------------------------------------
Use orderby +limit optimized carbondata1.2 master code + spark1.6.3
@Ravindra @Jarck : let us discuss it, see if can find out the common
solution,
Proposal 1 : Implement TopN solution.
Proposal 2 : Decoupled plug-in solution, continue to use Jarck's code,
just need change order by from dictionary value to actual dimensional value.
Regards
Liang
simafengyun wrote
> Recently , I used the latest code done test as below
> 1. Create Table:
> CREATE TABLE rx5_tbox_parquet_all(
> carid STRING,
> inputstime TIMESTAMP,
> carsyspwrmod INT,
> cardofrontpas INT,
> cardofrontdrv INT,
> cardorearleft INT,
> cardorearright INT,
> carbonnet INT,
> carboot INT,
> carwinfrontleft INT,
> carwinrearleft INT,
> carwinfrontright INT,
> carwinrearright INT,
> carsunroof INT,
> carcsactive INT,
> carcsenabled INT,
> carseatbeltdrv INT
> )
> STORED BY 'carbondata'
> TBLPROPERTIES('SORT_COLUMNS'='carid',
> 'DICTIONARY_INCLUDE'='carid')
>
> 2. Load 0.1 billion data
>
> 3. Run the below sql
>
> select
> carid,
> inputstime,
> carsyspwrmod,
> cardofrontpas,
> cardofrontdrv,
> cardorearleft,
> cardorearright,
> carbonnet,
> carboot,
> carwinfrontleft,
> carwinrearleft,
> carwinfrontright,
> carwinrearright
> from rx5_tbox_parquet_all2
> order by carid
> limit 10
>
>
> Use carbondata1.2 master code + spark2.1 to run
> |carid
> |inputstime|carsyspwrmod|cardofrontpas|cardofrontdrv|cardorearleft|cardorearright|carbonnet|carboot|carwinfrontleft|carwinrearleft|carwinfrontright|carwinrearright|
> +-----------------+--------+------------+---------------+---------------+---------------+----------------+---------+-------+------------------+-----------------+-------------------+------------------+
> |LSJA24790HS020662|null |2 |0 |0
> |0
> |0 |0 |0 |0 |0
>
> |0 |0 |
> |LSJA24790HS020662|null |2 |0 |0
> |0
> |0 |0 |0 |0 |0
>
> |0 |0 |
> |LSJA24790HS020662|null |2 |0 |0
> |0
> |0 |0 |0 |0 |0
>
> |0 |0 |
> |LSJA24790HS020662|null |2 |0 |0
> |0
> |0 |0 |0 |0 |0
>
> |0 |0 |
> |LSJA24790HS020662|null |2 |0 |0
> |0
> |0 |0 |0 |0 |0
>
> |0 |0 |
> |LSJA24790HS020662|null |2 |0 |0
> |0
> |0 |0 |0 |0 |0
>
> |0 |0 |
> |LSJA24790HS020662|null |2 |0 |0
> |0
> |0 |0 |0 |0 |0
>
> |0 |0 |
> |LSJA24790HS020662|null |2 |0 |0
> |0
> |0 |0 |0 |0 |0
>
> |0 |0 |
> |LSJA24790HS020662|null |2 |0 |0
> |0
> |0 |0 |0 |0 |0
>
> |0 |0 |
> |LSJA24790HS020662|null |2 |0 |0
> |0
> |0 |0 |0 |0 |0
>
> |0 |0 |
> +-----------------+--------+------------+---------------+---------------+---------------+----------------+---------+-------+------------------+-----------------+-------------------+------------------+
>
> limit 10 query time: 28777 milliseconds
>
> Use orderby +limit optimized carbondata1.2 master code + spark1.6.3 to run
>
> +-----------------+--------+------------+---------------+---------------+---------------+----------------+---------+-------+------------------+-----------------+-------------------+------------------+
> |carid
> |inputstime|carsyspwrmod|cardofrontpas|cardofrontdrv|cardorearleft|cardorearright|carbonnet|carboot|carwinfrontleft|carwinrearleft|carwinfrontright|carwinrearright|
> +-----------------+--------+------------+---------------+---------------+---------------+----------------+---------+-------+------------------+-----------------+-------------------+------------------+
> |LSJA24790HS020662|null |2 |0 |0
> |0
> |0 |0 |0 |0 |0
>
> |0 |0 |
> |LSJA24790HS020662|null |2 |0 |0
> |0
> |0 |0 |0 |0 |0
>
> |0 |0 |
> |LSJA24790HS020662|null |2 |0 |0
> |0
> |0 |0 |0 |0 |0
>
> |0 |0 |
> |LSJA24790HS020662|null |2 |0 |0
> |0
> |0 |0 |0 |0 |0
>
> |0 |0 |
> |LSJA24790HS020662|null |2 |0 |0
> |0
> |0 |0 |0 |0 |0
>
> |0 |0 |
> |LSJA24790HS020662|null |2 |0 |0
> |0
> |0 |0 |0 |0 |0
>
> |0 |0 |
> |LSJA24790HS020662|null |2 |0 |0
> |0
> |0 |0 |0 |0 |0
>
> |0 |0 |
> |LSJA24790HS020662|null |2 |0 |0
> |0
> |0 |0 |0 |0 |0
>
> |0 |0 |
> |LSJA24790HS020662|null |2 |0 |0
> |0
> |0 |0 |0 |0 |0
>
> |0 |0 |
> |LSJA24790HS020662|null |2 |0 |0
> |0
> |0 |0 |0 |0 |0
>
> |0 |0 |
> +-----------------+--------+------------+---------------+---------------+---------------+----------------+---------+-------+------------------+-----------------+-------------------+------------------+
>
> limit 10 query time: 1640 milliseconds
>
> Apparently, after optimization, even I use spark1.6.3, it also improved
> 90%
> performance
>
>
>
>
>
>
>
>
>
> Thanks
> 马云
>
>
>
>
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
Re: [DISCUSSION] optimization of OrderBy sorted columns + Limit
Query
Posted by simafengyun <si...@163.com>.
Recently , I used the latest code done test as below
1. Create Table:
CREATE TABLE rx5_tbox_parquet_all(
carid STRING,
inputstime TIMESTAMP,
carsyspwrmod INT,
cardofrontpas INT,
cardofrontdrv INT,
cardorearleft INT,
cardorearright INT,
carbonnet INT,
carboot INT,
carwinfrontleft INT,
carwinrearleft INT,
carwinfrontright INT,
carwinrearright INT,
carsunroof INT,
carcsactive INT,
carcsenabled INT,
carseatbeltdrv INT
)
STORED BY 'carbondata'
TBLPROPERTIES('SORT_COLUMNS'='carid',
'DICTIONARY_INCLUDE'='carid')
2. Load 0.1 billion data
3. Run the below sql
select
carid,
inputstime,
carsyspwrmod,
cardofrontpas,
cardofrontdrv,
cardorearleft,
cardorearright,
carbonnet,
carboot,
carwinfrontleft,
carwinrearleft,
carwinfrontright,
carwinrearright
from rx5_tbox_parquet_all2
order by carid
limit 10
Use carbondata1.2 master code + spark2.1 to run
|carid
|inputstime|carsyspwrmod|cardofrontpas|cardofrontdrv|cardorearleft|cardorearright|carbonnet|carboot|carwinfrontleft|carwinrearleft|carwinfrontright|carwinrearright|
+-----------------+--------+------------+---------------+---------------+---------------+----------------+---------+-------+------------------+-----------------+-------------------+------------------+
|LSJA24790HS020662|null |2 |0 |0 |0
|0 |0 |0 |0 |0
|0 |0 |
|LSJA24790HS020662|null |2 |0 |0 |0
|0 |0 |0 |0 |0
|0 |0 |
|LSJA24790HS020662|null |2 |0 |0 |0
|0 |0 |0 |0 |0
|0 |0 |
|LSJA24790HS020662|null |2 |0 |0 |0
|0 |0 |0 |0 |0
|0 |0 |
|LSJA24790HS020662|null |2 |0 |0 |0
|0 |0 |0 |0 |0
|0 |0 |
|LSJA24790HS020662|null |2 |0 |0 |0
|0 |0 |0 |0 |0
|0 |0 |
|LSJA24790HS020662|null |2 |0 |0 |0
|0 |0 |0 |0 |0
|0 |0 |
|LSJA24790HS020662|null |2 |0 |0 |0
|0 |0 |0 |0 |0
|0 |0 |
|LSJA24790HS020662|null |2 |0 |0 |0
|0 |0 |0 |0 |0
|0 |0 |
|LSJA24790HS020662|null |2 |0 |0 |0
|0 |0 |0 |0 |0
|0 |0 |
+-----------------+--------+------------+---------------+---------------+---------------+----------------+---------+-------+------------------+-----------------+-------------------+------------------+
limit 10 query time: *28777*
Use orderby +limit optimized carbondata1.2 master code + spark1.6.3 to run
+-----------------+--------+------------+---------------+---------------+---------------+----------------+---------+-------+------------------+-----------------+-------------------+------------------+
|carid
|inputstime|carsyspwrmod|cardofrontpas|cardofrontdrv|cardorearleft|cardorearright|carbonnet|carboot|carwinfrontleft|carwinrearleft|carwinfrontright|carwinrearright|
+-----------------+--------+------------+---------------+---------------+---------------+----------------+---------+-------+------------------+-----------------+-------------------+------------------+
|LSJA24790HS020662|null |2 |0 |0 |0
|0 |0 |0 |0 |0
|0 |0 |
|LSJA24790HS020662|null |2 |0 |0 |0
|0 |0 |0 |0 |0
|0 |0 |
|LSJA24790HS020662|null |2 |0 |0 |0
|0 |0 |0 |0 |0
|0 |0 |
|LSJA24790HS020662|null |2 |0 |0 |0
|0 |0 |0 |0 |0
|0 |0 |
|LSJA24790HS020662|null |2 |0 |0 |0
|0 |0 |0 |0 |0
|0 |0 |
|LSJA24790HS020662|null |2 |0 |0 |0
|0 |0 |0 |0 |0
|0 |0 |
|LSJA24790HS020662|null |2 |0 |0 |0
|0 |0 |0 |0 |0
|0 |0 |
|LSJA24790HS020662|null |2 |0 |0 |0
|0 |0 |0 |0 |0
|0 |0 |
|LSJA24790HS020662|null |2 |0 |0 |0
|0 |0 |0 |0 |0
|0 |0 |
|LSJA24790HS020662|null |2 |0 |0 |0
|0 |0 |0 |0 |0
|0 |0 |
+-----------------+--------+------------+---------------+---------------+---------------+----------------+---------+-------+------------------+-----------------+-------------------+------------------+
limit 10 query time: *1640*
Apparently, after optimization, even I use spark1.6.3, it also improved 90%
performance
Thanks
马云
--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/