You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Heji Kim <hs...@gmail.com> on 2017/02/03 00:50:27 UTC

persistence iops and throughput check? Re: Running a spark code on multiple machines using google cloud platform

Dear Anahita,

When we run performance tests for Spark/YARN clusters on GCP, we have to
make sure we are within iops and throughput limits.  Depending on disk type
(standard or SSD) and size of disk, you will only get so many max sustained
iops and throughput per sec. The GCP instance metrics graphs are not great
but enough to determine if you are over the limit.

https://cloud.google.com/compute/docs/disks/performance

Heji

On Thu, Feb 2, 2017 at 4:29 AM, Anahita Talebi <an...@gmail.com>
wrote:

> Dear all,
>
> I am trying to run a spark code on multiple machines using submit job in
> google cloud platform.
> As the inputs of my code, I have a training and testing datasets.
>
> When I use small training data set like (10kb), the code can be
> successfully ran on the google cloud while when I have a large data set
> like 50Gb, I received the following error:
>
> 17/02/01 19:08:06 ERROR org.apache.spark.scheduler.LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerTaskEnd(2,0,ResultTask,TaskKilled,org.apache.spark.scheduler.TaskInfo@3101f3b3,null)
>
> Does anyone can give me a hint how I can solve my problem?
>
> PS: I cannot use small training data set because I have an optimization code which needs to use all the data.
>
> I have to use google could platform because I need to run the code on multiple machines.
>
> Thanks a lot,
>
> Anahita
>
>