You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@kudu.apache.org by 冯宝利 <fe...@uce.cn> on 2020/12/04 04:21:09 UTC

spark3.0 read kudu data

Hi：
    Recently, we are upgrading spark from 2.4 to 3.0. We are doing performance testing and found some performance problems.Through the comparative test, it is found that spark3.0 reads kudu data much slower than 2.4. Normally, spark2.4 takes 0.1-1s to read the same amount of data, but spark3.0 takes 1 minute to 2 minutes.Both versions of spark use the same spark submit parameter and run in local mode. The read kudu clusters, tables and query conditions are consistent.
    The only difference is that the kudu spark package is different, and that for spark2.4 is kudu-spark2_2.11,scala version is  2.11, spark3.0 uses kudu-spark3_2.12 ,scala  version is  2.12(This package is based on the Java version compiled by kudu 1.13，use spark 3.0.0 and scala 2.12 pom.xml file )
    Our cluster uses CDH 6.3.1 and kudu version is 1.10.In view of this situation, what can be optimized or suggestions to improve the performance of kudu reading data?
    Thanks!