You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Russ Weeks <rw...@newbrightidea.com> on 2014/09/10 02:13:27 UTC

Spark + AccumuloInputFormat

Hi,

I'm trying to execute Spark SQL queries on top of the AccumuloInputFormat.
Not sure if I should be asking on the Spark list or the Accumulo list, but
I'll try here. The problem is that the workload to process SQL queries
doesn't seem to be distributed across my cluster very well.

My Spark SQL app is running in yarn-client mode. The query I'm running is
"select count(*) from audit_log" (or a similarly simple query) where my
audit_log table has 14.3M rows, 504M key value pairs spread fairly evenly
across 8 tablet servers. Looking at the Accumulo monitor app, I only ever
see a maximum of 2 tablet servers with active scans. Since the data is
spread across all the tablet servers, I hoped to see 8!

I realize there are a lot of moving parts here but I'd any advice about
where to start looking.

Using Spark 1.0.1 with Accumulo 1.6.

Thanks!
-Russ

Re: Spark + AccumuloInputFormat

Posted by Russ Weeks <rw...@newbrightidea.com>.

To answer my own question... I didn't realize that I was responsible for
telling Spark how much parallelism I wanted for my job. I figured that
between Spark and Yarn they'd figure it out for themselves.

Adding --executor-memory 3G --num-executors 24 to my spark-submit command
took the query time down to 30s from 18 minutes and I'm seeing much better
utilization of my accumulo tablet servers.

-Russ

On Tue, Sep 9, 2014 at 5:13 PM, Russ Weeks <rw...@newbrightidea.com> wrote:

> Hi,
>
> I'm trying to execute Spark SQL queries on top of the AccumuloInputFormat.
> Not sure if I should be asking on the Spark list or the Accumulo list, but
> I'll try here. The problem is that the workload to process SQL queries
> doesn't seem to be distributed across my cluster very well.
>
> My Spark SQL app is running in yarn-client mode. The query I'm running is
> "select count(*) from audit_log" (or a similarly simple query) where my
> audit_log table has 14.3M rows, 504M key value pairs spread fairly evenly
> across 8 tablet servers. Looking at the Accumulo monitor app, I only ever
> see a maximum of 2 tablet servers with active scans. Since the data is
> spread across all the tablet servers, I hoped to see 8!
>
> I realize there are a lot of moving parts here but I'd any advice about
> where to start looking.
>
> Using Spark 1.0.1 with Accumulo 1.6.
>
> Thanks!
> -Russ
>