You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@kudu.apache.org by Todd Lipcon <to...@cloudera.com> on 2018/07/05 19:14:59 UTC

Re: spark on kudu performance!

On Mon, Jun 11, 2018 at 5:52 AM, fengbaoli@uce.cn <fe...@uce.cn> wrote:

> Hi:
>
>  I use kudu official website development documents, use
> spark analysis kudu data(kudu's version is 1.6.0):
>
> the official  code is :
> *val df = sqlContext.read.options(Map("kudu.master" ->
> "kudu.master:7051","kudu.table" -> "kudu_table")).kudu // Query using the
> Spark API... df.select("id").filter("id" >= 5).show()*
>
>
> My question  is :
> (1)If I use the official website code, when creating
> data collection of df, the data of my table is about 1.8
> billion, and then the filter of df is performed. This is
> equivalent to loading 1.8 billion data into memory each
> time, and the performance is very poor.
>

That's not correct. Data frames are lazy-evaluated, so when you use a
filter like the above, it does not fully materialize the whole data frame
into memory before it begins to filter.

You can also use ".explain()" to see whether the filter you are specifying
is getting pushed down properly to Kudu.


>
> (2)Create a time-based range partition on the 1.8 billion
> table, and then directly use the underlying java api,scan
> partition to analyze, this is not the amount of data each
> time loading is the specified number of partitions instead
> of 1.8 billion data?
>
> Please give me some suggestions, thanks!
>
>
The above should happen automatically so long as the filter predicate has
been pushed down. Using 'explain()' and showing us the results, along with
the code you used to create your table, will help understand what might be
the problem with performance.

-Todd
--
Todd Lipcon
Software Engineer, Cloudera