You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@carbondata.apache.org by xm_zzc <44...@qq.com> on 2018/09/19 16:02:29 UTC

Low Performance of full scan.

Hi dev:  Recently, I compared the performance of full scan between parquet
and carbondata, found that the performance of full scan of carbondata was
worse than parquet.*My test:*    1. Spark 2.2 + Parquet with Spark 2.2 +
CarbonData(master branch)  2. Run on local[1] mode,   3. There are 8 parquet
files in one folder, total: 47474456 records, the size of each file is about
*170* MB;   4. There are 8 segments in one carbondata table, total: 47474456
records, each segment has one file, the size of each file is about *220 *MB,
there are *4 blocklets and 186 pages* in one carbondata file;  5. The data
of each parquet file and carbondata file is the same;  6. create table sql: 
7. test sql:     1). select count(chan),count(fcip),sum(size) from table;   
2). select chan,fcip,sum(size) from table group by chan, fcip order by chan,
fcip;*Test result:**  SQL1:    Parquet:          4s       4s       4s   
CarbonData:      12s      11s      12s  SQL2:    Parquet:         11s     
10s      11s    CarbonData:      18s      18s      19s**Analyse:*  I added
some time count in code and change the size of CarbonVectorProxy from 4 *
1024 to 32 * 1024, use non-prefetch mode.  The time stat (take one test) :   
1. BlockletFullScanner.readBlocklet:  169ms;   2.
BlockletFullScanner.scanBlocklet:  176ms;   3.
DictionaryBasedVectorResultCollector.collectResultInColumnarBatch: 7958ms,
in this part, it takes about 200-300ms to handle each blocklet, so it takes
totally about 1s to handle one carbondata file, but in carbon stat log it
shows that it takes about 1-2s to handle one carbondata file for SQL1 and
2-3s to handle one file for SQL2;   4. In CarbonScanRDD.internalCompute, the
iterator will execute 1464 times, each iterate takes about 8-9ms for SQL1
and 10-15ms for SQL2;   5. The total time of 1-3 steps are almost the same
for SQL1 and SQL2;*Questions:*  1. any optimization on
DictionaryBasedVectorResultCollector.collectResultInColumnarBatch ?  2. It
takes about 1s to handle one carbondata file in my time stat, but actually
it takes about 1-2s for SQL1 and 2-3s for SQL2 in Spark ui, why? shuffle?
compute?  3. Can it support to configurate the size of CarbonVectorProxy to
reduce times of iterate? Default value is 4 * 1024 and iterate executes
11616 times.BTW, if the optimization(this mailling thread 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/Discussion-Carbondata-Store-size-optimization-td62283.html
<http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/Discussion-Carbondata-Store-size-optimization-td62283.html>  
mentions) is done, I will use this case to test again.Any feedback is
welcome.



--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

Re: Low Performance of full scan.

Posted by xm_zzc <44...@qq.com>.

Hi chuanyin:
  I used SQL1 and SQL2 as test cases and ran on local[4] mode, 
  when the rowNum of CarbonVectorProxy (actually it's the capacity of
ColumnarBatch) is 4 * 1024 (default):
  SQL1: 8s, 9s (run two times), SQL2: 12s, 11s
  but when it's 16 * 1024:
  SQL1: 6s, 6s,                         SQL2: 9s, 8s

  So the changing of this property benefits my two test cases.



--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

Re: Low Performance of full scan.

Posted by xuchuanyin <xu...@hust.edu.cn>.

If this property is configurable, how do you want to use it?

Does the changing of this property benefit all your queries? If it doesn’t. A system property may be bad to meet all the queries. Then how about a hint for this property?

> On Sep 20, 2018, at 00:02, xm_zzc <44...@qq.com> wrote:
> 
> 3. Can it support to configurate the size of CarbonVectorProxy to
> reduce times of iterate? Default value is 4 * 1024 and iterate executes
> 11616 times.

Re: Low Performance of full scan.

Posted by xm_zzc <44...@qq.com>.

Hi Ravindra:
    I re-test my test cases mentioned above with Spark 2.3.2 + CarbonData
master branch, the query performance of carbondata are almost the same as
the parquet:

*Test result:** 
  SQL1:    Parquet:      4.6s       4s         3.8s   
           CarbonData:   4.7s       3.6s       3.5s   
  SQL2:    Parquet:      9s     8s      8s     
           CarbonData:   9s     8s      8s 

  The query performance of CarbonData has improved a lot (SQL1: 12s to 4s,
SQL2: 18 to 8s) while the query performance of parquet has also improved
(SQL2: 10s to 8s). That's great.
  But I saw the test result you mentioned in
'http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/CarbonData-Performance-Optimization-td62950.html',
the query performance of carbondata were almost better than the parquet. I
want to know how you tested those cases? And are there other optimizations
that have not been merged yet?

Regards, 
Zhichao.



--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

Re: Low Performance of full scan.

Posted by ravipesala <ra...@gmail.com>.

Hi,

Thanks for testing the performance. We have also observed this performance
difference and working on to improve the same. Please check my latest
discussion
(http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/CarbonData-Performance-Optimization-td62950.html)
to improve scan performance and raised PR (still WIP) for the same. 
And also there is one more discussion
(http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/Discussion-Carbondata-Store-size-optimization-td62283.html)
to optimize store and improve performance. 

Regards,
Ravindra.



--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/