You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@zeppelin.apache.org by "Jung, Soonoh" <so...@gmail.com> on 2016/10/04 00:37:48 UTC

Restart zeppelin spark interpreter

Hi everyone,

I am using Zeppelin in AWS EMR (Zeppelin 0.6.1, spark 2.0 on Yarn RM)
Basically Zeppelin spark interpreter's spark job is not finishing after
executing a notebook.
It looks like the spark job still occupying memory a lot in my Yarn cluster.
Is there a way restart spark interpreter automatically(or pragmatically)
every time I run a notebook in order to release that memory in my Yarn
cluster?

Regards,
Soonoh

Re: Restart zeppelin spark interpreter

Posted by "Jung, Soonoh" <so...@gmail.com>.

Hi Jonathan,

I reinstalled EMR without maximizeResourceAllocation, but still have same
problem.

I changed default spark interpreter with "isolated" option.
After I executed some notes and all notes finished, it still uses a lot
memory.
spark history ui says all jobs completed but one
app(application_1475616447166_0014)'s executors tab, there's lots of active
executors, other 3 spark apps only have one active executors.
I wonder why those executors are not removed or yarn cluster memory is not
released?

[image: Inline images 2]

[image: Inline images 3]
[image: Inline images 1]


Here is how I created AWS-EMR instance:

aws emr create-cluster \
     --termination-protected \
     --applications Name=Ganglia Name=Spark Name=Zeppelin Name=Hive \
     --service-role EMR_DefaultRole \
     --enable-debugging \
     --release-label emr-5.0.0 \
     --name "${EMR_NAME}" \
     --instance-groups '[
{"InstanceCount":1,"InstanceGroupType":"MASTER","InstanceType":"m3.xlarge","Name":"Master
Instance Group"},
{"InstanceCount":2,"InstanceGroupType":"CORE","InstanceType":"r3.xlarge","Name":"Core
Instance Group"},
{"InstanceCount":6,"BidPrice":"0.15","InstanceGroupType":"TASK","InstanceType":"r3.xlarge","Name":"Task
instance group - 6"}]' \
     --configurations '[
{"Classification":"hadoop-env",
    "Properties":{},

"Configurations":[{"Classification":"export","Properties":{"JAVA_HOME":"/usr/lib/jvm/java-1.8.0"},"Configurations":[]}]},
{"Classification":"spark-env",
    "Properties":{"maximizeResourceAllocation":"false"},

"Configurations":[{"Classification":"export","Properties":{"JAVA_HOME":"/usr/lib/jvm/java-1.8.0"},"Configurations":[]}]},
{"Classification":"zeppelin-env",
    "Properties":{},
    "Configurations":[
        {"Classification":"export","Properties":{

"ZEPPELIN_NOTEBOOK_STORAGE":"org.apache.zeppelin.notebook.repo.S3NotebookRepo",
            "ZEPPELIN_NOTEBOOK_S3_BUCKET":"zeppelin-notebook",
            "ZEPPELIN_NOTEBOOK_S3_USER":"zeppelin-user"}
    }]
}]'

and no other manual configuration in zeppelin, spark and yarn sides.

Regards,
Soonoh


On 4 October 2016 at 12:01, Jung, Soonoh <so...@gmail.com> wrote:

> Hi Jonathan,
>
> Thank you for the information!
> Yes, I am using maximizeResourceAllocation. I will try turn off this and
> just use dynamicAllocation alone.
>
> Regards,
> Soonoh
>
> On 4 October 2016 at 11:07, Jonathan Kelly <jo...@gmail.com> wrote:
>
>> On the most recent several releases of EMR, Spark dynamicAllocation is
>> automatically enabled, as it allows longer running apps like Zeppelin's
>> Spark interpreter to continue running in the background without taking up
>> resources for any executors unless Spark jobs are actively running.
>>
>> However, if you are seeing resources still being used even after some
>> idle time, maybe you are using maximizeResourceAllocation (which makes any
>> Spark job use 100% of the resources, with one executor per slave node). If
>> you use maximizeResourceAllocation, it effectively disables
>> dynamicAllocation because it causes spark.executor.instances to be set. If
>> you still want to use dynamicAllocation along with
>> maxizeResourceAllocation, just set spark.dynamicAllocation.enabled to
>> true in the spark-defaults configuration classification. This will signal
>> to the maximizeResourceAllocation feature not to set
>> spark.executor.instances so that dynamicAllocation will be used.
>>
>> Keep in mind that this might not be the most ideal way to use
>> dynamicAllocation though (especially if you don't have many nodes in the
>> cluster) because the maximizeResourceAllocation feature would make the
>> executors very coarsely grained since there's only one per node. It would
>> still allow multiple applications to run at once though because executors
>> from one application could spin down when idle, allowing another
>> application to spin up executors.
>>
>> Hope this helps,
>> Jonathan
>>
>> On Mon, Oct 3, 2016 at 5:38 PM Jung, Soonoh <so...@gmail.com>
>> wrote:
>>
>>> Hi everyone,
>>>
>>> I am using Zeppelin in AWS EMR (Zeppelin 0.6.1, spark 2.0 on Yarn RM)
>>> Basically Zeppelin spark interpreter's spark job is not finishing after
>>> executing a notebook.
>>> It looks like the spark job still occupying memory a lot in my Yarn
>>> cluster.
>>> Is there a way restart spark interpreter automatically(or pragmatically)
>>> every time I run a notebook in order to release that memory in my Yarn
>>> cluster?
>>>
>>> Regards,
>>> Soonoh
>>>
>>
>

Re: Restart zeppelin spark interpreter

Posted by "Jung, Soonoh" <so...@gmail.com>.

Hi Jonathan,

Thank you for the information!
Yes, I am using maximizeResourceAllocation. I will try turn off this and
just use dynamicAllocation alone.

Regards,
Soonoh

On 4 October 2016 at 11:07, Jonathan Kelly <jo...@gmail.com> wrote:

> On the most recent several releases of EMR, Spark dynamicAllocation is
> automatically enabled, as it allows longer running apps like Zeppelin's
> Spark interpreter to continue running in the background without taking up
> resources for any executors unless Spark jobs are actively running.
>
> However, if you are seeing resources still being used even after some idle
> time, maybe you are using maximizeResourceAllocation (which makes any Spark
> job use 100% of the resources, with one executor per slave node). If you
> use maximizeResourceAllocation, it effectively disables dynamicAllocation
> because it causes spark.executor.instances to be set. If you still want to
> use dynamicAllocation along with maxizeResourceAllocation, just set
> spark.dynamicAllocation.enabled to true in the spark-defaults
> configuration classification. This will signal to the
> maximizeResourceAllocation feature not to set spark.executor.instances so
> that dynamicAllocation will be used.
>
> Keep in mind that this might not be the most ideal way to use
> dynamicAllocation though (especially if you don't have many nodes in the
> cluster) because the maximizeResourceAllocation feature would make the
> executors very coarsely grained since there's only one per node. It would
> still allow multiple applications to run at once though because executors
> from one application could spin down when idle, allowing another
> application to spin up executors.
>
> Hope this helps,
> Jonathan
>
> On Mon, Oct 3, 2016 at 5:38 PM Jung, Soonoh <so...@gmail.com> wrote:
>
>> Hi everyone,
>>
>> I am using Zeppelin in AWS EMR (Zeppelin 0.6.1, spark 2.0 on Yarn RM)
>> Basically Zeppelin spark interpreter's spark job is not finishing after
>> executing a notebook.
>> It looks like the spark job still occupying memory a lot in my Yarn
>> cluster.
>> Is there a way restart spark interpreter automatically(or pragmatically)
>> every time I run a notebook in order to release that memory in my Yarn
>> cluster?
>>
>> Regards,
>> Soonoh
>>
>

Re: Restart zeppelin spark interpreter

Posted by Jonathan Kelly <jo...@gmail.com>.

On the most recent several releases of EMR, Spark dynamicAllocation is
automatically enabled, as it allows longer running apps like Zeppelin's
Spark interpreter to continue running in the background without taking up
resources for any executors unless Spark jobs are actively running.

However, if you are seeing resources still being used even after some idle
time, maybe you are using maximizeResourceAllocation (which makes any Spark
job use 100% of the resources, with one executor per slave node). If you
use maximizeResourceAllocation, it effectively disables dynamicAllocation
because it causes spark.executor.instances to be set. If you still want to
use dynamicAllocation along with maxizeResourceAllocation, just set
spark.dynamicAllocation.enabled to true in the spark-defaults configuration
classification. This will signal to the maximizeResourceAllocation feature
not to set spark.executor.instances so that dynamicAllocation will be used.

Keep in mind that this might not be the most ideal way to use
dynamicAllocation though (especially if you don't have many nodes in the
cluster) because the maximizeResourceAllocation feature would make the
executors very coarsely grained since there's only one per node. It would
still allow multiple applications to run at once though because executors
from one application could spin down when idle, allowing another
application to spin up executors.

Hope this helps,
Jonathan
On Mon, Oct 3, 2016 at 5:38 PM Jung, Soonoh <so...@gmail.com> wrote:

> Hi everyone,
>
> I am using Zeppelin in AWS EMR (Zeppelin 0.6.1, spark 2.0 on Yarn RM)
> Basically Zeppelin spark interpreter's spark job is not finishing after
> executing a notebook.
> It looks like the spark job still occupying memory a lot in my Yarn
> cluster.
> Is there a way restart spark interpreter automatically(or pragmatically)
> every time I run a notebook in order to release that memory in my Yarn
> cluster?
>
> Regards,
> Soonoh
>