You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@zeppelin.apache.org by Shanmukha Sreenivas Potti <sh...@utexas.edu> on 2017/05/03 22:54:21 UTC

Zeppelin best practices/ efficiencies

Hello Zeppelin users,



I’m reaching out to you for some guidance on best practices. We currently
use Zeppelin 0.7.0 on EMR and I have a few questions on gaining
efficiencies with this setup that I would like to get addressed. Would
really appreciate if any of you can help me with these issues or point me
to the right person/team.



*1.       **Interpreter Settings*



I understand that the newer versions (we are currently on Zeppelin 0.7),
have the option of different interpreter nodes such as Scoped, Isolated and
Shared.

Multiple users in our team use the Zeppelin application by creating
separate notebooks. Sometimes, jobs continue to execute endlessly or fail
to execute or time out due to maxing out on memory. We tend to restart the
interpreter or are sometimes forced to restart Zeppelin application on the
EMR master node to resume operations. Is this the best way to deal with
such issues?

We currently use the ‘Scoped’ interpreter setting, i.e. it sets up an
interpreter instance per note.

Would you recommend that we continue to use this interpreter setting or do
you think we would be served better by using any other available
interpreter settings? I did take a look at the Zeppelin documentation for
information on these settings but anything additional would be greatly
helpful.



Also, is there a way to accurately determine how much of the available
memory is being used by the various jobs on Zeppelin? The ‘Job’ tab gives
us insights on what jobs in various notebooks are running but we don’t have
insight on the memory/compute power being used.



Ideally, I would like to figure out the root cause behind why my queries
are not running. Is it because of memory maxing out on Zeppelin or HDFS or
Spark or because of insufficiency in the number of compute nodes.



Would really appreciate if you could share any documentation that can guide
me on these aspects.



*2.       **Installation Ports*

By default Zeppelin on EMR gets installed on port 8890. However, to be
complaint with security policies we needed to use other ports. This change
was made by editing the Zeppelin configuration file in SSH. I’m concerned
if this approach has cloned the application on the other ports and also
restricting my usage of Zeppelin. Is this the right way of installing
Zeppelin on another port?



Appreciate any pointers you may have. Please see below for more information
on the cluster and the applications on the cluster.



*Thanks,*

*Shan*



*Cluster Details:*

Release label: emr-5.4.0

Applications: Hive 2.1.1, Pig 0.16.0, Hue 3.11.0, Spark 2.1.0, HBase 1.3.0,
Zeppelin 0.7.0, Oozie 4.3.0, Mahout 0.12.2

Re: Zeppelin best practices/ efficiencies

Posted by Shanmukha Sreenivas Potti <sh...@utexas.edu>.

Thanks, Jeff!

I'll look into this solution.

On Wed, May 3, 2017 at 5:32 PM, Jeff Zhang <zj...@gmail.com> wrote:

>
> Regarding interpreter memory issue, this is because zeppelin's spark
> interpreter only support yarn-client mode, that means the driver runs in
> the same host as zeppelin server. So it is pretty easy to run out of memory
> if many users share the same driver (scoped mode you use). You can try livy
> interpreter which support yarn-cluster, so that the driver run in the
> remote host and each user use isolated spark app.
> https://zeppelin.apache.org/docs/0.8.0-SNAPSHOT/interpreter/livy.html
>
>
> Shanmukha Sreenivas Potti <sh...@utexas.edu>于2017年5月4日周四 上午6:54写道：
>
>> Hello Zeppelin users,
>>
>>
>>
>> I’m reaching out to you for some guidance on best practices. We currently
>> use Zeppelin 0.7.0 on EMR and I have a few questions on gaining
>> efficiencies with this setup that I would like to get addressed. Would
>> really appreciate if any of you can help me with these issues or point me
>> to the right person/team.
>>
>>
>>
>> *1.       **Interpreter Settings*
>>
>>
>>
>> I understand that the newer versions (we are currently on Zeppelin 0.7),
>> have the option of different interpreter nodes such as Scoped, Isolated and
>> Shared.
>>
>> Multiple users in our team use the Zeppelin application by creating
>> separate notebooks. Sometimes, jobs continue to execute endlessly or fail
>> to execute or time out due to maxing out on memory. We tend to restart the
>> interpreter or are sometimes forced to restart Zeppelin application on the
>> EMR master node to resume operations. Is this the best way to deal with
>> such issues?
>>
>> We currently use the ‘Scoped’ interpreter setting, i.e. it sets up an
>> interpreter instance per note.
>>
>> Would you recommend that we continue to use this interpreter setting or
>> do you think we would be served better by using any other available
>> interpreter settings? I did take a look at the Zeppelin documentation for
>> information on these settings but anything additional would be greatly
>> helpful.
>>
>>
>>
>> Also, is there a way to accurately determine how much of the available
>> memory is being used by the various jobs on Zeppelin? The ‘Job’ tab gives
>> us insights on what jobs in various notebooks are running but we don’t have
>> insight on the memory/compute power being used.
>>
>>
>>
>> Ideally, I would like to figure out the root cause behind why my queries
>> are not running. Is it because of memory maxing out on Zeppelin or HDFS or
>> Spark or because of insufficiency in the number of compute nodes.
>>
>>
>>
>> Would really appreciate if you could share any documentation that can
>> guide me on these aspects.
>>
>>
>>
>> *2.       **Installation Ports*
>>
>> By default Zeppelin on EMR gets installed on port 8890. However, to be
>> complaint with security policies we needed to use other ports. This change
>> was made by editing the Zeppelin configuration file in SSH. I’m concerned
>> if this approach has cloned the application on the other ports and also
>> restricting my usage of Zeppelin. Is this the right way of installing
>> Zeppelin on another port?
>>
>>
>>
>> Appreciate any pointers you may have. Please see below for more
>> information on the cluster and the applications on the cluster.
>>
>>
>>
>> *Thanks,*
>>
>> *Shan*
>>
>>
>>
>> *Cluster Details:*
>>
>> Release label: emr-5.4.0
>>
>> Applications: Hive 2.1.1, Pig 0.16.0, Hue 3.11.0, Spark 2.1.0, HBase
>> 1.3.0, Zeppelin 0.7.0, Oozie 4.3.0, Mahout 0.12.2
>>
>


-- 
Shan S. Potti,
737-333-1952
https://www.linkedin.com/in/shanmukhasreenivas

Re: Zeppelin best practices/ efficiencies

Posted by Jeff Zhang <zj...@gmail.com>.

Regarding interpreter memory issue, this is because zeppelin's spark
interpreter only support yarn-client mode, that means the driver runs in
the same host as zeppelin server. So it is pretty easy to run out of memory
if many users share the same driver (scoped mode you use). You can try livy
interpreter which support yarn-cluster, so that the driver run in the
remote host and each user use isolated spark app.
https://zeppelin.apache.org/docs/0.8.0-SNAPSHOT/interpreter/livy.html


Shanmukha Sreenivas Potti <sh...@utexas.edu>于2017年5月4日周四 上午6:54写道：

> Hello Zeppelin users,
>
>
>
> I’m reaching out to you for some guidance on best practices. We currently
> use Zeppelin 0.7.0 on EMR and I have a few questions on gaining
> efficiencies with this setup that I would like to get addressed. Would
> really appreciate if any of you can help me with these issues or point me
> to the right person/team.
>
>
>
> *1.       **Interpreter Settings*
>
>
>
> I understand that the newer versions (we are currently on Zeppelin 0.7),
> have the option of different interpreter nodes such as Scoped, Isolated and
> Shared.
>
> Multiple users in our team use the Zeppelin application by creating
> separate notebooks. Sometimes, jobs continue to execute endlessly or fail
> to execute or time out due to maxing out on memory. We tend to restart the
> interpreter or are sometimes forced to restart Zeppelin application on the
> EMR master node to resume operations. Is this the best way to deal with
> such issues?
>
> We currently use the ‘Scoped’ interpreter setting, i.e. it sets up an
> interpreter instance per note.
>
> Would you recommend that we continue to use this interpreter setting or do
> you think we would be served better by using any other available
> interpreter settings? I did take a look at the Zeppelin documentation for
> information on these settings but anything additional would be greatly
> helpful.
>
>
>
> Also, is there a way to accurately determine how much of the available
> memory is being used by the various jobs on Zeppelin? The ‘Job’ tab gives
> us insights on what jobs in various notebooks are running but we don’t have
> insight on the memory/compute power being used.
>
>
>
> Ideally, I would like to figure out the root cause behind why my queries
> are not running. Is it because of memory maxing out on Zeppelin or HDFS or
> Spark or because of insufficiency in the number of compute nodes.
>
>
>
> Would really appreciate if you could share any documentation that can
> guide me on these aspects.
>
>
>
> *2.       **Installation Ports*
>
> By default Zeppelin on EMR gets installed on port 8890. However, to be
> complaint with security policies we needed to use other ports. This change
> was made by editing the Zeppelin configuration file in SSH. I’m concerned
> if this approach has cloned the application on the other ports and also
> restricting my usage of Zeppelin. Is this the right way of installing
> Zeppelin on another port?
>
>
>
> Appreciate any pointers you may have. Please see below for more
> information on the cluster and the applications on the cluster.
>
>
>
> *Thanks,*
>
> *Shan*
>
>
>
> *Cluster Details:*
>
> Release label: emr-5.4.0
>
> Applications: Hive 2.1.1, Pig 0.16.0, Hue 3.11.0, Spark 2.1.0, HBase
> 1.3.0, Zeppelin 0.7.0, Oozie 4.3.0, Mahout 0.12.2
>