You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@zeppelin.apache.org by Pietro Pugni <pi...@gmail.com> on 2017/05/06 10:47:00 UTC

spark.r interpreter becomes unresponsive after some time and R process quits silently

Hi all,
I am facing a strange issue on two different machines that acts like servers. Each of them runs an instance of Zeppelin installed as a system.d service.
The configuration is:
- Ubuntu Server 16.04.2 LTS
- Spark 2.1.0
- Microsoft Open R 3.3.2
- Zeppelin 0.7.1 (0.7.0 gave the same problems)

zeppelin-env.sh has the following settings:
export SPARK_HOME="/spark/home/directory"

spark-env.sh has the following settings:
export LANG="en_US"
export SPARK_DAEMON_JAVA_OPTS+=" -Dspark.local.dir=/some/dir -Dspark.eventLog.dir=/some/dir/spark-events -Dhadoop.tmp.dir=/some/dir"
export _JAVA_OPTIONS+=" -Djava.io.tmpdir=/some/dir"

spark-defaults.conf is set as:
spark.executor.memory 21g
spark.driver.memory 21g
spark.python.worker.memory 4g
spark.sql.autoBroadcastJoinThreshold 0

I use Spark in stand-alone mode and it works perfectly. It also works correctly with Zeppelin but this is what happens:
1) Start zeppelin on the server using the command service zeppelin start
2) Connect to port 8080 using Mozilla Firefox from client
3) Insert username and password (I enabled Shiro authentication)
4) open a notebook
5) Execute the following code:
%spark.r
2+2
6) The code runs correctly and I can see that R is currently running as a process.
7) Repeat steps 2-5 after some time (let’s say 2 or 3 hours) and Zeppelin remains forever on “Running” or, if the elapsed time is higher (for example 1 day) since the last run, it returns “Error”. The “time-to-be-unresponsive” seems to be random and unpredictable. Also, R is not present in the list of running processes. Spark session remains active because I can access Spark UI from port 4040 and the application name is “Zeppelin”, so it’s the Spark instance created by Zeppelin.

I observed that sometimes I can simply restart the interpreter from Zeppelin UI, but many other times it doesn’t work and I have to restart Zeppelin ( service zeppelin restart ).

This issue afflicts both 0.7.0 and 0.7.1 but I haven’t tried with previous versions. It also happens if Zeppelin isn’t installed as a service.

I can’t provide more detail because I can’t see any error or warning in the logs.. this is really strange.

Thank you all.
Kind regards
Pietro Pugni

Re: spark.r interpreter becomes unresponsive after some time and R process quits silently

Posted by Pietro Pugni <pi...@gmail.com>.

I opened a JIRA with all the details (logs etc):
https://issues.apache.org/jira/plugins/servlet/mobile#issue/ZEPPELIN-2515

Thank you
 Pietro Pugni

Il 9 mag 2017 7:48 PM, "Jongyoul Lee" <jo...@gmail.com> ha scritto:

Hi, Thanks for this detail debugging.

At first, notebookserver doesn't have any clue for this symptom because
it's used between browser and zeppelin server.

I don't know why R has stoped unexpectedly. Is there any log related to R?
I'm not familiar with R actually.

BTW, I'll install R and test it in my local

On Tue, May 9, 2017 at 8:29 AM, Pietro Pugni <pi...@gmail.com> wrote:

> I repost this because it didn’t appear on the mailing list board.
>
> These are the step needed to reproduce the error and to track down the log
> message.
>
> 1) I started a brand new instance of zeppelin issuing:
> service zeppelin start
>
> and started a bash script that tracks down R processes activity.
> After running a simple R script from Zeppelin, the R interpreter process
> was started:
>
> Mon May  8 11:27:59 CEST 2017 >>> R started
>
> 2) I left the browser open and at 12:26:15 I closed the browser. Zeppelin
> tracked down the connection being closed:
> INFO [2017-05-08 12:26:15,879] ({qtp423031029-60}
> NotebookServer.java[onClose]:363) - Closed connection to 127.0.0.1 :
> 33798. (1001) null
>
> 3) At 13:08:00 R was closed. My script returned:
> Mon May  8 13:08:00 CEST 2017 >>> R stopped
>
> This is the output from the interpreter log file (deleted non-useful
> lines):
> INFO [2017-05-08 11:27:43,632] ({Thread-0} RemoteInterpreterServer.java[run]:95)
> - Starting remote interpreter server on port 45227
> INFO [2017-05-08 11:27:44,600] ({pool-1-thread-3}
> RemoteInterpreterServer.java[createInterpreter]:190) - Instantiate
> interpreter org.apache.zeppelin.spark.SparkInterpreter
> INFO [2017-05-08 11:27:44,624] ({pool-1-thread-3}
> RemoteInterpreterServer.java[createInterpreter]:190) - Instantiate
> interpreter org.apache.zeppelin.spark.SparkSqlInterpreter
> INFO [2017-05-08 11:27:44,629] ({pool-1-thread-3}
> RemoteInterpreterServer.java[createInterpreter]:190) - Instantiate
> interpreter org.apache.zeppelin.spark.DepInterpreter
> INFO [2017-05-08 11:27:44,640] ({pool-1-thread-3}
> RemoteInterpreterServer.java[createInterpreter]:190) - Instantiate
> interpreter org.apache.zeppelin.spark.PySparkInterpreter
> INFO [2017-05-08 11:27:44,643] ({pool-1-thread-3}
> RemoteInterpreterServer.java[createInterpreter]:190) - Instantiate
> interpreter org.apache.zeppelin.spark.SparkRInterpreter
> ...
> INFO [2017-05-08 11:28:00,188] ({pool-2-thread-2}
> SchedulerFactory.java[jobFinished]:137) - Job
> remoteInterpretJob_1494235664723 finished by scheduler
> org.apache.zeppelin.spark.SparkRInterpreter2097894179
> DEBUG [2017-05-08 11:28:00,819] ({pool-1-thread-3}
> RemoteInterpreterServer.java[resourcePoolGetAll]:911) - Request getAll
> from ZeppelinServer
> *DEBUG [2017-05-08 13:08:00,187] ({Exec Stream Pumper}
> InterpreterOutputStream.java[processLine]:72) - Interpreter output:Error in
> handleErrors(returnStatus, conn) : *
> *DEBUG [2017-05-08 13:08:00,188] ({Exec Stream Pumper}
> InterpreterOutputStream.java[processLine]:72) - Interpreter output:  No
> status is returned. Java SparkR backend might have failed.*
> *DEBUG [2017-05-08 13:08:00,188] ({Exec Stream Pumper}
> InterpreterOutputStream.java[processLine]:72) - Interpreter output:Calls:
> <Anonymous> -> invokeJava -> handleErrors*
> *DEBUG [2017-05-08 13:08:00,188] ({Exec Stream Pumper}
> InterpreterOutputStream.java[processLine]:72) - Interpreter
> output:Execution halted*
>
> This is the output from zeppelin log file (it didn't track the R
> interpreter failure):
> INFO [2017-05-08 11:28:00,221] ({pool-2-thread-2}
> NotebookServer.java[afterStatusChange]:2056) - Job
> 20170506-145151_1585482989 is finished successfully, status: FINISHED
> INFO [2017-05-08 11:28:00,675] ({pool-2-thread-2}
> SchedulerFactory.java[jobFinished]:137) - Job paragraph_1494075111996_-
> 1250116940 finished by scheduler org.apache.zeppelin.interprete
> r.remote.RemoteInterpretershared_session2130846287
> *INFO [2017-05-08 12:26:15,879] ({qtp423031029-60}
> NotebookServer.java[onClose]:363) - Closed connection to 127.0.0.1 : 33798.
> (1001) null*
> INFO [2017-05-08 12:27:12,126] ({Thread-33} AbstractValidatingSessionManager.java[validateSessions]:271)
> - Validating all active sessions...
> INFO [2017-05-08 12:27:12,126] ({Thread-33} AbstractValidatingSessionManager.java[validateSessions]:304)
> - Finished session validation.  No sessions were stopped.
>
> Hope this helps.
> Any hints?
>
> Il giorno 08 mag 2017, alle ore 11:08, Pietro Pugni <
> pietro.pugni@gmail.com> ha scritto:
>
> I know for sure that R process gets killed (or quits) but don't know if
> its father process (interpreter.sh) gets killed too.
>
> I noticed that I can always restart the interpreter on 0.7.1 while
> sometimes it was impossible to do on 0.7.0 (I had to manually restart
> zeppelin service). Probably that JIRA improved the situation a little bit.
>
> Now I'm running a bash script that tracks start and stop time of R process
> in order to shed some light on this issue. I enabled DEBUG logging in log4j
> properties file.
>
>
> Il 6 mag 2017 4:43 PM, "Paul Brenner" <pb...@placeiq.com> ha scritto:
>
>> Great work documenting repeatable steps for this hard to nail down
>> problem. I see similar problems running the spark (scala) interpreter but
>> haven’t been as systematic about hunting down the issue as you.
>>
>> I do wonder if this is related somehow to https://issues.apache.org/j
>> ira/browse/ZEPPELIN-1832
>> <https://share.polymail.io/v1/z/b/NTkwZGRlMzNiZmFi/Go00wlomvjABQNciq78PfdeRmR4K6c4M5l8KsTYGlks2sD4oe9jS7NYIkVZ2KKlntmyN0z2ZbiIFSP59SQpYL0hq_V6k3ZjCvIj_gDhLCD5s9K74YEQl1S5xOyCx0TK-xuhEd59t3p3nhZrhs1xXLJxUEM6PoX1EWAcJswdLQj6oNrNLeE-0uF9D4DZjlMlBWs_aYKvi14I21deKenrCDUCPJccm>
>> which just seems to have addressed killing off zombie processes but I’m
>> not sure it covered where zombie processes are coming from. Perhaps we need
>> to open a ticket for this?
>>
>> In the mean time if you don’t have the ability to restart zeppelin every
>> time you run into this process you can probably just kill the interpreter
>> process. I find myself doing that multiple times in an normal work day.
>>
>> <http://www.placeiq.com/> <http://www.placeiq.com/>
>> <http://www.placeiq.com/> Paul Brenner <https://twitter.com/placeiq>
>> <https://twitter.com/placeiq> <https://twitter.com/placeiq>
>> <https://www.facebook.com/PlaceIQ> <https://www.facebook.com/PlaceIQ>
>> <https://www.linkedin.com/company/placeiq>
>> <https://www.linkedin.com/company/placeiq>
>> DATA SCIENTIST
>> *(217) 390-3033 <(217)%20390-3033> *
>>
>> <http://www.placeiq.com/2015/05/26/placeiq-named-winner-of-prestigious-2015-oracle-data-cloud-activate-award/>
>> <http://placeiq.com/2015/12/18/accuracy-vs-precision-in-location-data-mma-webinar/>
>> <http://placeiq.com/2015/12/18/accuracy-vs-precision-in-location-data-mma-webinar/>
>> <http://placeiq.com/2015/12/18/accuracy-vs-precision-in-location-data-mma-webinar/>
>> <http://placeiq.com/2015/12/18/accuracy-vs-precision-in-location-data-mma-webinar/>
>> <http://placeiq.com/2016/03/08/measuring-addressable-tv-campaigns-is-now-possible/>
>> <http://placeiq.com/2016/04/13/placeiq-joins-the-network-advertising-initiative-nai-as-100th-member/>
>> <http://placeiq.com/2016/04/13/placeiq-joins-the-network-advertising-initiative-nai-as-100th-member/>
>> <http://placeiq.com/2016/04/13/placeiq-joins-the-network-advertising-initiative-nai-as-100th-member/>
>> <http://placeiq.com/2016/04/13/placeiq-joins-the-network-advertising-initiative-nai-as-100th-member/>
>> <http://placeiq.com/2016/04/13/placeiq-joins-the-network-advertising-initiative-nai-as-100th-member/>
>> <http://pages.placeiq.com/Location-Data-Accuracy-Whitepaper-Download.html?utm_source=Signature&utm_medium=Email&utm_campaign=AccuracyWP>
>> <http://placeiq.com/2016/08/03/placeiq-bolsters-location-intelligence-platform-with-mastercard-insights/>
>> <http://placeiq.com/2016/10/26/the-making-of-a-location-data-industry-milestone/>[image:
>> PlaceIQ:Location Data Accuracy]
>> <http://placeiq.com/2016/12/07/placeiq-introduces-landmark-a-groundbreaking-offering-that-delivers-access-to-the-highest-quality-location-data-for-insights-that-fuel-limitless-business-decisions/>
>>
>> On Sat, May 06, 2017 at 6:47 AM Pietro Pugni <Pietro Pugni
>> <Pietro+Pugni+%3Cpietro.pugni@gmail.com%3E>> wrote:
>>
>>> Hi all,
>>> I am facing a strange issue on two different machines that acts like
>>> servers. Each of them runs an instance of Zeppelin installed as a system.d
>>> service.
>>> The configuration is:
>>>  - Ubuntu Server 16.04.2 LTS
>>>  - Spark 2.1.0
>>>  - Microsoft Open R 3.3.2
>>>  - Zeppelin 0.7.1 (0.7.0 gave the same problems)
>>>
>>> zeppelin-env.sh has the following settings:
>>> export SPARK_HOME="/spark/home/directory"
>>>
>>> spark-env.sh has the following settings:
>>> export LANG="en_US"
>>> export SPARK_DAEMON_JAVA_OPTS+=" -Dspark.local.dir=/some/dir
>>> -Dspark.eventLog.dir=/some/dir/spark-events -Dhadoop.tmp.dir=/some/dir"
>>> export _JAVA_OPTIONS+=" -Djava.io.tmpdir=/some/dir"
>>>
>>> spark-defaults.conf is set as:
>>> spark.executor.memory           21g
>>> spark.driver.memory                     21g
>>> spark.python.worker.memory       4g
>>> spark.sql.autoBroadcastJoinThreshold    0
>>>
>>> I use Spark in stand-alone mode and it works perfectly. It also works
>>> correctly with Zeppelin but this is what happens:
>>> 1) Start zeppelin on the server using the command service zeppelin start
>>> 2) Connect to port 8080 using Mozilla Firefox from client
>>> 3) Insert username and password (I enabled Shiro authentication)
>>> 4) open a notebook
>>> 5) Execute the following code:
>>> %spark.r
>>> 2+2
>>> 6) The code runs correctly and I can see that R is currently running as
>>> a process.
>>> 7) Repeat steps 2-5 after some time (let’s say 2 or 3 hours) and
>>> Zeppelin remains forever on “Running” or, if the elapsed time is higher
>>> (for example 1 day) since the last run, it returns “Error”. The
>>> “time-to-be-unresponsive” seems to be random and unpredictable. Also, R is
>>> not present in the list of running processes. Spark session remains active
>>> because I can access Spark UI from port 4040 and the application name is
>>> “Zeppelin”, so it’s the Spark instance created by Zeppelin.
>>>
>>> I observed that sometimes I can simply restart the interpreter from
>>> Zeppelin UI, but many other times it doesn’t work and I have to restart
>>> Zeppelin ( service zeppelin restart ).
>>>
>>> This issue afflicts both 0.7.0 and 0.7.1 but I haven’t tried with
>>> previous versions. It also happens if Zeppelin isn’t installed as a service.
>>>
>>> I can’t provide more detail because I can’t see any error or warning in
>>> the logs.. this is really strange.
>>>
>>> Thank you all.
>>> Kind regards
>>>  Pietro Pugni
>>>
>>
>>
>
>


-- 
이종열, Jongyoul Lee, 李宗烈
http://madeng.net

Re: spark.r interpreter becomes unresponsive after some time and R process quits silently

Posted by Jongyoul Lee <jo...@gmail.com>.

Hi, Thanks for this detail debugging.

At first, notebookserver doesn't have any clue for this symptom because
it's used between browser and zeppelin server.

I don't know why R has stoped unexpectedly. Is there any log related to R?
I'm not familiar with R actually.

BTW, I'll install R and test it in my local

On Tue, May 9, 2017 at 8:29 AM, Pietro Pugni <pi...@gmail.com> wrote:

> I repost this because it didn’t appear on the mailing list board.
>
> These are the step needed to reproduce the error and to track down the log
> message.
>
> 1) I started a brand new instance of zeppelin issuing:
> service zeppelin start
>
> and started a bash script that tracks down R processes activity.
> After running a simple R script from Zeppelin, the R interpreter process
> was started:
>
> Mon May  8 11:27:59 CEST 2017 >>> R started
>
> 2) I left the browser open and at 12:26:15 I closed the browser. Zeppelin
> tracked down the connection being closed:
> INFO [2017-05-08 12:26:15,879] ({qtp423031029-60}
> NotebookServer.java[onClose]:363) - Closed connection to 127.0.0.1 :
> 33798. (1001) null
>
> 3) At 13:08:00 R was closed. My script returned:
> Mon May  8 13:08:00 CEST 2017 >>> R stopped
>
> This is the output from the interpreter log file (deleted non-useful
> lines):
> INFO [2017-05-08 11:27:43,632] ({Thread-0} RemoteInterpreterServer.java[run]:95)
> - Starting remote interpreter server on port 45227
> INFO [2017-05-08 11:27:44,600] ({pool-1-thread-3}
> RemoteInterpreterServer.java[createInterpreter]:190) - Instantiate
> interpreter org.apache.zeppelin.spark.SparkInterpreter
> INFO [2017-05-08 11:27:44,624] ({pool-1-thread-3}
> RemoteInterpreterServer.java[createInterpreter]:190) - Instantiate
> interpreter org.apache.zeppelin.spark.SparkSqlInterpreter
> INFO [2017-05-08 11:27:44,629] ({pool-1-thread-3}
> RemoteInterpreterServer.java[createInterpreter]:190) - Instantiate
> interpreter org.apache.zeppelin.spark.DepInterpreter
> INFO [2017-05-08 11:27:44,640] ({pool-1-thread-3}
> RemoteInterpreterServer.java[createInterpreter]:190) - Instantiate
> interpreter org.apache.zeppelin.spark.PySparkInterpreter
> INFO [2017-05-08 11:27:44,643] ({pool-1-thread-3}
> RemoteInterpreterServer.java[createInterpreter]:190) - Instantiate
> interpreter org.apache.zeppelin.spark.SparkRInterpreter
> ...
> INFO [2017-05-08 11:28:00,188] ({pool-2-thread-2} SchedulerFactory.java[jobFinished]:137)
> - Job remoteInterpretJob_1494235664723 finished by scheduler
> org.apache.zeppelin.spark.SparkRInterpreter2097894179
> DEBUG [2017-05-08 11:28:00,819] ({pool-1-thread-3}
> RemoteInterpreterServer.java[resourcePoolGetAll]:911) - Request getAll
> from ZeppelinServer
> *DEBUG [2017-05-08 13:08:00,187] ({Exec Stream Pumper}
> InterpreterOutputStream.java[processLine]:72) - Interpreter output:Error in
> handleErrors(returnStatus, conn) : *
> *DEBUG [2017-05-08 13:08:00,188] ({Exec Stream Pumper}
> InterpreterOutputStream.java[processLine]:72) - Interpreter output:  No
> status is returned. Java SparkR backend might have failed.*
> *DEBUG [2017-05-08 13:08:00,188] ({Exec Stream Pumper}
> InterpreterOutputStream.java[processLine]:72) - Interpreter output:Calls:
> <Anonymous> -> invokeJava -> handleErrors*
> *DEBUG [2017-05-08 13:08:00,188] ({Exec Stream Pumper}
> InterpreterOutputStream.java[processLine]:72) - Interpreter
> output:Execution halted*
>
> This is the output from zeppelin log file (it didn't track the R
> interpreter failure):
> INFO [2017-05-08 11:28:00,221] ({pool-2-thread-2} NotebookServer.java[afterStatusChange]:2056)
> - Job 20170506-145151_1585482989 is finished successfully, status: FINISHED
> INFO [2017-05-08 11:28:00,675] ({pool-2-thread-2} SchedulerFactory.java[jobFinished]:137)
> - Job paragraph_1494075111996_-1250116940 finished by scheduler
> org.apache.zeppelin.interpreter.remote.RemoteInterpretershared_
> session2130846287
> *INFO [2017-05-08 12:26:15,879] ({qtp423031029-60}
> NotebookServer.java[onClose]:363) - Closed connection to 127.0.0.1 : 33798.
> (1001) null*
> INFO [2017-05-08 12:27:12,126] ({Thread-33} AbstractValidatingSessionManager.java[validateSessions]:271)
> - Validating all active sessions...
> INFO [2017-05-08 12:27:12,126] ({Thread-33} AbstractValidatingSessionManager.java[validateSessions]:304)
> - Finished session validation.  No sessions were stopped.
>
> Hope this helps.
> Any hints?
>
> Il giorno 08 mag 2017, alle ore 11:08, Pietro Pugni <
> pietro.pugni@gmail.com> ha scritto:
>
> I know for sure that R process gets killed (or quits) but don't know if
> its father process (interpreter.sh) gets killed too.
>
> I noticed that I can always restart the interpreter on 0.7.1 while
> sometimes it was impossible to do on 0.7.0 (I had to manually restart
> zeppelin service). Probably that JIRA improved the situation a little bit.
>
> Now I'm running a bash script that tracks start and stop time of R process
> in order to shed some light on this issue. I enabled DEBUG logging in log4j
> properties file.
>
>
> Il 6 mag 2017 4:43 PM, "Paul Brenner" <pb...@placeiq.com> ha scritto:
>
>> Great work documenting repeatable steps for this hard to nail down
>> problem. I see similar problems running the spark (scala) interpreter but
>> haven’t been as systematic about hunting down the issue as you.
>>
>> I do wonder if this is related somehow to https://issues.apache.org/j
>> ira/browse/ZEPPELIN-1832
>> <https://share.polymail.io/v1/z/b/NTkwZGRlMzNiZmFi/Go00wlomvjABQNciq78PfdeRmR4K6c4M5l8KsTYGlks2sD4oe9jS7NYIkVZ2KKlntmyN0z2ZbiIFSP59SQpYL0hq_V6k3ZjCvIj_gDhLCD5s9K74YEQl1S5xOyCx0TK-xuhEd59t3p3nhZrhs1xXLJxUEM6PoX1EWAcJswdLQj6oNrNLeE-0uF9D4DZjlMlBWs_aYKvi14I21deKenrCDUCPJccm>
>> which just seems to have addressed killing off zombie processes but I’m
>> not sure it covered where zombie processes are coming from. Perhaps we need
>> to open a ticket for this?
>>
>> In the mean time if you don’t have the ability to restart zeppelin every
>> time you run into this process you can probably just kill the interpreter
>> process. I find myself doing that multiple times in an normal work day.
>>
>> <http://www.placeiq.com/> <http://www.placeiq.com/>
>> <http://www.placeiq.com/> Paul Brenner <https://twitter.com/placeiq>
>> <https://twitter.com/placeiq> <https://twitter.com/placeiq>
>> <https://www.facebook.com/PlaceIQ> <https://www.facebook.com/PlaceIQ>
>> <https://www.linkedin.com/company/placeiq>
>> <https://www.linkedin.com/company/placeiq>
>> DATA SCIENTIST
>> *(217) 390-3033 *
>>
>> <http://www.placeiq.com/2015/05/26/placeiq-named-winner-of-prestigious-2015-oracle-data-cloud-activate-award/>
>> <http://placeiq.com/2015/12/18/accuracy-vs-precision-in-location-data-mma-webinar/>
>> <http://placeiq.com/2015/12/18/accuracy-vs-precision-in-location-data-mma-webinar/>
>> <http://placeiq.com/2015/12/18/accuracy-vs-precision-in-location-data-mma-webinar/>
>> <http://placeiq.com/2015/12/18/accuracy-vs-precision-in-location-data-mma-webinar/>
>> <http://placeiq.com/2016/03/08/measuring-addressable-tv-campaigns-is-now-possible/>
>> <http://placeiq.com/2016/04/13/placeiq-joins-the-network-advertising-initiative-nai-as-100th-member/>
>> <http://placeiq.com/2016/04/13/placeiq-joins-the-network-advertising-initiative-nai-as-100th-member/>
>> <http://placeiq.com/2016/04/13/placeiq-joins-the-network-advertising-initiative-nai-as-100th-member/>
>> <http://placeiq.com/2016/04/13/placeiq-joins-the-network-advertising-initiative-nai-as-100th-member/>
>> <http://placeiq.com/2016/04/13/placeiq-joins-the-network-advertising-initiative-nai-as-100th-member/>
>> <http://pages.placeiq.com/Location-Data-Accuracy-Whitepaper-Download.html?utm_source=Signature&utm_medium=Email&utm_campaign=AccuracyWP>
>> <http://placeiq.com/2016/08/03/placeiq-bolsters-location-intelligence-platform-with-mastercard-insights/>
>> <http://placeiq.com/2016/10/26/the-making-of-a-location-data-industry-milestone/>[image:
>> PlaceIQ:Location Data Accuracy]
>> <http://placeiq.com/2016/12/07/placeiq-introduces-landmark-a-groundbreaking-offering-that-delivers-access-to-the-highest-quality-location-data-for-insights-that-fuel-limitless-business-decisions/>
>>
>> On Sat, May 06, 2017 at 6:47 AM Pietro Pugni <Pietro Pugni
>> <Pietro+Pugni+%3Cpietro.pugni@gmail.com%3E>> wrote:
>>
>>> Hi all,
>>> I am facing a strange issue on two different machines that acts like
>>> servers. Each of them runs an instance of Zeppelin installed as a system.d
>>> service.
>>> The configuration is:
>>>  - Ubuntu Server 16.04.2 LTS
>>>  - Spark 2.1.0
>>>  - Microsoft Open R 3.3.2
>>>  - Zeppelin 0.7.1 (0.7.0 gave the same problems)
>>>
>>> zeppelin-env.sh has the following settings:
>>> export SPARK_HOME="/spark/home/directory"
>>>
>>> spark-env.sh has the following settings:
>>> export LANG="en_US"
>>> export SPARK_DAEMON_JAVA_OPTS+=" -Dspark.local.dir=/some/dir
>>> -Dspark.eventLog.dir=/some/dir/spark-events -Dhadoop.tmp.dir=/some/dir"
>>> export _JAVA_OPTIONS+=" -Djava.io.tmpdir=/some/dir"
>>>
>>> spark-defaults.conf is set as:
>>> spark.executor.memory           21g
>>> spark.driver.memory                     21g
>>> spark.python.worker.memory       4g
>>> spark.sql.autoBroadcastJoinThreshold    0
>>>
>>> I use Spark in stand-alone mode and it works perfectly. It also works
>>> correctly with Zeppelin but this is what happens:
>>> 1) Start zeppelin on the server using the command service zeppelin start
>>> 2) Connect to port 8080 using Mozilla Firefox from client
>>> 3) Insert username and password (I enabled Shiro authentication)
>>> 4) open a notebook
>>> 5) Execute the following code:
>>> %spark.r
>>> 2+2
>>> 6) The code runs correctly and I can see that R is currently running as
>>> a process.
>>> 7) Repeat steps 2-5 after some time (let’s say 2 or 3 hours) and
>>> Zeppelin remains forever on “Running” or, if the elapsed time is higher
>>> (for example 1 day) since the last run, it returns “Error”. The
>>> “time-to-be-unresponsive” seems to be random and unpredictable. Also, R is
>>> not present in the list of running processes. Spark session remains active
>>> because I can access Spark UI from port 4040 and the application name is
>>> “Zeppelin”, so it’s the Spark instance created by Zeppelin.
>>>
>>> I observed that sometimes I can simply restart the interpreter from
>>> Zeppelin UI, but many other times it doesn’t work and I have to restart
>>> Zeppelin ( service zeppelin restart ).
>>>
>>> This issue afflicts both 0.7.0 and 0.7.1 but I haven’t tried with
>>> previous versions. It also happens if Zeppelin isn’t installed as a service.
>>>
>>> I can’t provide more detail because I can’t see any error or warning in
>>> the logs.. this is really strange.
>>>
>>> Thank you all.
>>> Kind regards
>>>  Pietro Pugni
>>>
>>
>>
>
>


-- 
이종열, Jongyoul Lee, 李宗烈
http://madeng.net

Re: spark.r interpreter becomes unresponsive after some time and R process quits silently

Posted by Pietro Pugni <pi...@gmail.com>.

I repost this because it didn’t appear on the mailing list board.

These are the step needed to reproduce the error and to track down the log message.

1) I started a brand new instance of zeppelin issuing:
service zeppelin start

and started a bash script that tracks down R processes activity.
After running a simple R script from Zeppelin, the R interpreter process was started:

Mon May  8 11:27:59 CEST 2017 >>> R started

2) I left the browser open and at 12:26:15 I closed the browser. Zeppelin tracked down the connection being closed:
INFO [2017-05-08 12:26:15,879] ({qtp423031029-60} NotebookServer.java[onClose]:363) - Closed connection to 127.0.0.1 : 33798. (1001) null

3) At 13:08:00 R was closed. My script returned:
Mon May  8 13:08:00 CEST 2017 >>> R stopped

This is the output from the interpreter log file (deleted non-useful lines):
INFO [2017-05-08 11:27:43,632] ({Thread-0} RemoteInterpreterServer.java[run]:95) - Starting remote interpreter server on port 45227
INFO [2017-05-08 11:27:44,600] ({pool-1-thread-3} RemoteInterpreterServer.java[createInterpreter]:190) - Instantiate interpreter org.apache.zeppelin.spark.SparkInterpreter
INFO [2017-05-08 11:27:44,624] ({pool-1-thread-3} RemoteInterpreterServer.java[createInterpreter]:190) - Instantiate interpreter org.apache.zeppelin.spark.SparkSqlInterpreter
INFO [2017-05-08 11:27:44,629] ({pool-1-thread-3} RemoteInterpreterServer.java[createInterpreter]:190) - Instantiate interpreter org.apache.zeppelin.spark.DepInterpreter
INFO [2017-05-08 11:27:44,640] ({pool-1-thread-3} RemoteInterpreterServer.java[createInterpreter]:190) - Instantiate interpreter org.apache.zeppelin.spark.PySparkInterpreter
INFO [2017-05-08 11:27:44,643] ({pool-1-thread-3} RemoteInterpreterServer.java[createInterpreter]:190) - Instantiate interpreter org.apache.zeppelin.spark.SparkRInterpreter
...
INFO [2017-05-08 11:28:00,188] ({pool-2-thread-2} SchedulerFactory.java[jobFinished]:137) - Job remoteInterpretJob_1494235664723 finished by scheduler org.apache.zeppelin.spark.SparkRInterpreter2097894179
DEBUG [2017-05-08 11:28:00,819] ({pool-1-thread-3} RemoteInterpreterServer.java[resourcePoolGetAll]:911) - Request getAll from ZeppelinServer
DEBUG [2017-05-08 13:08:00,187] ({Exec Stream Pumper} InterpreterOutputStream.java[processLine]:72) - Interpreter output:Error in handleErrors(returnStatus, conn) : 
DEBUG [2017-05-08 13:08:00,188] ({Exec Stream Pumper} InterpreterOutputStream.java[processLine]:72) - Interpreter output:  No status is returned. Java SparkR backend might have failed.
DEBUG [2017-05-08 13:08:00,188] ({Exec Stream Pumper} InterpreterOutputStream.java[processLine]:72) - Interpreter output:Calls: <Anonymous> -> invokeJava -> handleErrors
DEBUG [2017-05-08 13:08:00,188] ({Exec Stream Pumper} InterpreterOutputStream.java[processLine]:72) - Interpreter output:Execution halted

This is the output from zeppelin log file (it didn't track the R interpreter failure):
INFO [2017-05-08 11:28:00,221] ({pool-2-thread-2} NotebookServer.java[afterStatusChange]:2056) - Job 20170506-145151_1585482989 is finished successfully, status: FINISHED
INFO [2017-05-08 11:28:00,675] ({pool-2-thread-2} SchedulerFactory.java[jobFinished]:137) - Job paragraph_1494075111996_-1250116940 finished by scheduler org.apache.zeppelin.interpreter.remote.RemoteInterpretershared_session2130846287
INFO [2017-05-08 12:26:15,879] ({qtp423031029-60} NotebookServer.java[onClose]:363) - Closed connection to 127.0.0.1 : 33798. (1001) null
INFO [2017-05-08 12:27:12,126] ({Thread-33} AbstractValidatingSessionManager.java[validateSessions]:271) - Validating all active sessions...
INFO [2017-05-08 12:27:12,126] ({Thread-33} AbstractValidatingSessionManager.java[validateSessions]:304) - Finished session validation.  No sessions were stopped.

Hope this helps. 
Any hints?

>> Il giorno 08 mag 2017, alle ore 11:08, Pietro Pugni <pietro.pugni@gmail.com <ma...@gmail.com>> ha scritto:
>> 
>> I know for sure that R process gets killed (or quits) but don't know if its father process (interpreter.sh) gets killed too.
>> 
>> I noticed that I can always restart the interpreter on 0.7.1 while sometimes it was impossible to do on 0.7.0 (I had to manually restart zeppelin service). Probably that JIRA improved the situation a little bit.
>> 
>> Now I'm running a bash script that tracks start and stop time of R process in order to shed some light on this issue. I enabled DEBUG logging in log4j properties file.
>> 
>> 
>> Il 6 mag 2017 4:43 PM, "Paul Brenner" <pbrenner@placeiq.com <ma...@placeiq.com>> ha scritto:
>> 
>> Great work documenting repeatable steps for this hard to nail down problem. I see similar problems running the spark (scala) interpreter but haven’t been as systematic about hunting down the issue as you. 
>> 
>> I do wonder if this is related somehow to https://issues.apache.org/jira/browse/ZEPPELIN-1832 <https://share.polymail.io/v1/z/b/NTkwZGRlMzNiZmFi/Go00wlomvjABQNciq78PfdeRmR4K6c4M5l8KsTYGlks2sD4oe9jS7NYIkVZ2KKlntmyN0z2ZbiIFSP59SQpYL0hq_V6k3ZjCvIj_gDhLCD5s9K74YEQl1S5xOyCx0TK-xuhEd59t3p3nhZrhs1xXLJxUEM6PoX1EWAcJswdLQj6oNrNLeE-0uF9D4DZjlMlBWs_aYKvi14I21deKenrCDUCPJccm>
>> which just seems to have addressed killing off zombie processes but I’m not sure it covered where zombie processes are coming from. Perhaps we need to open a ticket for this?
>> 
>> In the mean time if you don’t have the ability to restart zeppelin every time you run into this process you can probably just kill the interpreter process. I find myself doing that multiple times in an normal work day.
>> 
>>  <http://www.placeiq.com/> <http://www.placeiq.com/> <http://www.placeiq.com/>	Paul Brenner	 <https://twitter.com/placeiq> <https://twitter.com/placeiq> <https://twitter.com/placeiq>	 <https://www.facebook.com/PlaceIQ> <https://www.facebook.com/PlaceIQ>	 <https://www.linkedin.com/company/placeiq> <https://www.linkedin.com/company/placeiq>
>> DATA SCIENTIST
>> (217) 390-3033  
>> 
>>  <http://www.placeiq.com/2015/05/26/placeiq-named-winner-of-prestigious-2015-oracle-data-cloud-activate-award/> <http://placeiq.com/2015/12/18/accuracy-vs-precision-in-location-data-mma-webinar/> <http://placeiq.com/2015/12/18/accuracy-vs-precision-in-location-data-mma-webinar/> <http://placeiq.com/2015/12/18/accuracy-vs-precision-in-location-data-mma-webinar/> <http://placeiq.com/2015/12/18/accuracy-vs-precision-in-location-data-mma-webinar/> <http://placeiq.com/2016/03/08/measuring-addressable-tv-campaigns-is-now-possible/> <http://placeiq.com/2016/04/13/placeiq-joins-the-network-advertising-initiative-nai-as-100th-member/> <http://placeiq.com/2016/04/13/placeiq-joins-the-network-advertising-initiative-nai-as-100th-member/> <http://placeiq.com/2016/04/13/placeiq-joins-the-network-advertising-initiative-nai-as-100th-member/> <http://placeiq.com/2016/04/13/placeiq-joins-the-network-advertising-initiative-nai-as-100th-member/> <http://placeiq.com/2016/04/13/placeiq-joins-the-network-advertising-initiative-nai-as-100th-member/> <http://pages.placeiq.com/Location-Data-Accuracy-Whitepaper-Download.html?utm_source=Signature&utm_medium=Email&utm_campaign=AccuracyWP> <http://placeiq.com/2016/08/03/placeiq-bolsters-location-intelligence-platform-with-mastercard-insights/> <http://placeiq.com/2016/10/26/the-making-of-a-location-data-industry-milestone/> <http://placeiq.com/2016/12/07/placeiq-introduces-landmark-a-groundbreaking-offering-that-delivers-access-to-the-highest-quality-location-data-for-insights-that-fuel-limitless-business-decisions/>
>> 
>> On Sat, May 06, 2017 at 6:47 AM Pietro Pugni <Pietro Pugni  <mailto:Pietro+Pugni+%3Cpietro.pugni@gmail.com%3E>> wrote:
>> Hi all,
>> I am facing a strange issue on two different machines that acts like servers. Each of them runs an instance of Zeppelin installed as a system.d service.
>> The configuration is:
>>  - Ubuntu Server 16.04.2 LTS
>>  - Spark 2.1.0
>>  - Microsoft Open R 3.3.2
>>  - Zeppelin 0.7.1 (0.7.0 gave the same problems)
>> 
>> zeppelin-env.sh has the following settings:
>> export SPARK_HOME="/spark/home/directory"
>> 
>> spark-env.sh has the following settings:
>> export LANG="en_US"
>> export SPARK_DAEMON_JAVA_OPTS+=" -Dspark.local.dir=/some/dir -Dspark.eventLog.dir=/some/dir/spark-events -Dhadoop.tmp.dir=/some/dir"
>> export _JAVA_OPTIONS+=" -Djava.io.tmpdir=/some/dir"
>> 
>> spark-defaults.conf is set as:
>> spark.executor.memory           	21g
>> spark.driver.memory                     21g
>> spark.python.worker.memory      	4g
>> spark.sql.autoBroadcastJoinThreshold    0
>> 
>> I use Spark in stand-alone mode and it works perfectly. It also works correctly with Zeppelin but this is what happens:
>> 1) Start zeppelin on the server using the command service zeppelin start
>> 2) Connect to port 8080 using Mozilla Firefox from client 
>> 3) Insert username and password (I enabled Shiro authentication)
>> 4) open a notebook
>> 5) Execute the following code:
>> %spark.r
>> 2+2
>> 6) The code runs correctly and I can see that R is currently running as a process.
>> 7) Repeat steps 2-5 after some time (let’s say 2 or 3 hours) and Zeppelin remains forever on “Running” or, if the elapsed time is higher (for example 1 day) since the last run, it returns “Error”. The “time-to-be-unresponsive” seems to be random and unpredictable. Also, R is not present in the list of running processes. Spark session remains active because I can access Spark UI from port 4040 and the application name is “Zeppelin”, so it’s the Spark instance created by Zeppelin.
>> 
>> I observed that sometimes I can simply restart the interpreter from Zeppelin UI, but many other times it doesn’t work and I have to restart Zeppelin ( service zeppelin restart ).
>> 
>> This issue afflicts both 0.7.0 and 0.7.1 but I haven’t tried with previous versions. It also happens if Zeppelin isn’t installed as a service.
>> 
>> I can’t provide more detail because I can’t see any error or warning in the logs.. this is really strange. 
>> 
>> Thank you all.
>> Kind regards
>>  Pietro Pugni
>> 
>

Re: spark.r interpreter becomes unresponsive after some time and R process quits silently

Posted by Pietro Pugni <pi...@gmail.com>.

I know for sure that R process gets killed (or quits) but don't know if its
father process (interpreter.sh) gets killed too.

I noticed that I can always restart the interpreter on 0.7.1 while
sometimes it was impossible to do on 0.7.0 (I had to manually restart
zeppelin service). Probably that JIRA improved the situation a little bit.

Now I'm running a bash script that tracks start and stop time of R process
in order to shed some light on this issue. I enabled DEBUG logging in log4j
properties file.


Il 6 mag 2017 4:43 PM, "Paul Brenner" <pb...@placeiq.com> ha scritto:

> Great work documenting repeatable steps for this hard to nail down
> problem. I see similar problems running the spark (scala) interpreter but
> haven’t been as systematic about hunting down the issue as you.
>
> I do wonder if this is related somehow to https://issues.apache.org/
> jira/browse/ZEPPELIN-1832
> <https://share.polymail.io/v1/z/b/NTkwZGRlMzNiZmFi/Go00wlomvjABQNciq78PfdeRmR4K6c4M5l8KsTYGlks2sD4oe9jS7NYIkVZ2KKlntmyN0z2ZbiIFSP59SQpYL0hq_V6k3ZjCvIj_gDhLCD5s9K74YEQl1S5xOyCx0TK-xuhEd59t3p3nhZrhs1xXLJxUEM6PoX1EWAcJswdLQj6oNrNLeE-0uF9D4DZjlMlBWs_aYKvi14I21deKenrCDUCPJccm>
> which just seems to have addressed killing off zombie processes but I’m
> not sure it covered where zombie processes are coming from. Perhaps we need
> to open a ticket for this?
>
> In the mean time if you don’t have the ability to restart zeppelin every
> time you run into this process you can probably just kill the interpreter
> process. I find myself doing that multiple times in an normal work day.
>
> <http://www.placeiq.com/> <http://www.placeiq.com/>
> <http://www.placeiq.com/> Paul Brenner <https://twitter.com/placeiq>
> <https://twitter.com/placeiq> <https://twitter.com/placeiq>
> <https://www.facebook.com/PlaceIQ> <https://www.facebook.com/PlaceIQ>
> <https://www.linkedin.com/company/placeiq>
> <https://www.linkedin.com/company/placeiq>
> DATA SCIENTIST
> *(217) 390-3033 *
> [image: PlaceIQ:Location Data Accuracy]
> <http://placeiq.com/2016/12/07/placeiq-introduces-landmark-a-groundbreaking-offering-that-delivers-access-to-the-highest-quality-location-data-for-insights-that-fuel-limitless-business-decisions/>
> <http://placeiq.com/2016/10/26/the-making-of-a-location-data-industry-milestone/>
> <http://placeiq.com/2016/08/03/placeiq-bolsters-location-intelligence-platform-with-mastercard-insights/>
> <http://pages.placeiq.com/Location-Data-Accuracy-Whitepaper-Download.html?utm_source=Signature&utm_medium=Email&utm_campaign=AccuracyWP>
> <http://placeiq.com/2016/04/13/placeiq-joins-the-network-advertising-initiative-nai-as-100th-member/>
> <http://placeiq.com/2016/04/13/placeiq-joins-the-network-advertising-initiative-nai-as-100th-member/>
> <http://placeiq.com/2016/04/13/placeiq-joins-the-network-advertising-initiative-nai-as-100th-member/>
> <http://placeiq.com/2016/04/13/placeiq-joins-the-network-advertising-initiative-nai-as-100th-member/>
> <http://placeiq.com/2016/04/13/placeiq-joins-the-network-advertising-initiative-nai-as-100th-member/>
> <http://placeiq.com/2016/03/08/measuring-addressable-tv-campaigns-is-now-possible/>
> <http://placeiq.com/2015/12/18/accuracy-vs-precision-in-location-data-mma-webinar/>
> <http://placeiq.com/2015/12/18/accuracy-vs-precision-in-location-data-mma-webinar/>
> <http://placeiq.com/2015/12/18/accuracy-vs-precision-in-location-data-mma-webinar/>
> <http://placeiq.com/2015/12/18/accuracy-vs-precision-in-location-data-mma-webinar/>
> <http://www.placeiq.com/2015/05/26/placeiq-named-winner-of-prestigious-2015-oracle-data-cloud-activate-award/>
>
> On Sat, May 06, 2017 at 6:47 AM Pietro Pugni <Pietro Pugni
> <Pietro+Pugni+%3Cpietro.pugni@gmail.com%3E>> wrote:
>
>> Hi all,
>> I am facing a strange issue on two different machines that acts like
>> servers. Each of them runs an instance of Zeppelin installed as a system.d
>> service.
>> The configuration is:
>>  - Ubuntu Server 16.04.2 LTS
>>  - Spark 2.1.0
>>  - Microsoft Open R 3.3.2
>>  - Zeppelin 0.7.1 (0.7.0 gave the same problems)
>>
>> zeppelin-env.sh has the following settings:
>> export SPARK_HOME="/spark/home/directory"
>>
>> spark-env.sh has the following settings:
>> export LANG="en_US"
>> export SPARK_DAEMON_JAVA_OPTS+=" -Dspark.local.dir=/some/dir
>> -Dspark.eventLog.dir=/some/dir/spark-events -Dhadoop.tmp.dir=/some/dir"
>> export _JAVA_OPTIONS+=" -Djava.io.tmpdir=/some/dir"
>>
>> spark-defaults.conf is set as:
>> spark.executor.memory           21g
>> spark.driver.memory                     21g
>> spark.python.worker.memory       4g
>> spark.sql.autoBroadcastJoinThreshold    0
>>
>> I use Spark in stand-alone mode and it works perfectly. It also works
>> correctly with Zeppelin but this is what happens:
>> 1) Start zeppelin on the server using the command service zeppelin start
>> 2) Connect to port 8080 using Mozilla Firefox from client
>> 3) Insert username and password (I enabled Shiro authentication)
>> 4) open a notebook
>> 5) Execute the following code:
>> %spark.r
>> 2+2
>> 6) The code runs correctly and I can see that R is currently running as a
>> process.
>> 7) Repeat steps 2-5 after some time (let’s say 2 or 3 hours) and Zeppelin
>> remains forever on “Running” or, if the elapsed time is higher (for example
>> 1 day) since the last run, it returns “Error”. The
>> “time-to-be-unresponsive” seems to be random and unpredictable. Also, R is
>> not present in the list of running processes. Spark session remains active
>> because I can access Spark UI from port 4040 and the application name is
>> “Zeppelin”, so it’s the Spark instance created by Zeppelin.
>>
>> I observed that sometimes I can simply restart the interpreter from
>> Zeppelin UI, but many other times it doesn’t work and I have to restart
>> Zeppelin ( service zeppelin restart ).
>>
>> This issue afflicts both 0.7.0 and 0.7.1 but I haven’t tried with
>> previous versions. It also happens if Zeppelin isn’t installed as a service.
>>
>> I can’t provide more detail because I can’t see any error or warning in
>> the logs.. this is really strange.
>>
>> Thank you all.
>> Kind regards
>>  Pietro Pugni
>>
>
>

spark.r interpreter becomes unresponsive after some time and R process quits silently

Posted by Paul Brenner <pb...@placeiq.com>.

Great work documenting repeatable steps for this hard to nail down problem. I see similar problems running the spark (scala) interpreter but haven’t been as systematic about hunting down the issue as you. 

I do wonder if this is related somehow to 
https://share.polymail.io/v1/z/b/NTkwZGRlMzNiZmFi/Go00wlomvjABQNciq78PfdeRmR4K6c4M5l8KsTYGlks2sD4oe9jS7NYIkVZ2KKlntmyN0z2ZbiIFSP59SQpYL0hq_V6k3ZjCvIj_hSJLDj9DoLv9d08g_CcyOzm8nDm0hYZeZOp12dO42cm970BBLMdQE4GNuXkJXxBA8x9FHzXuJqALbU6-4HZjnzxjiNBKO7esfqjghuuz-eV-QrJnyI5hTNPgwp0O
which just seems to have addressed killing off zombie processes but I’m not sure it covered where zombie processes are coming from. Perhaps we need to open a ticket for this?

In the mean time if you don’t have the ability to restart zeppelin every time you run into this process you can probably just kill the interpreter process. I find myself doing that multiple times in an normal work day.

http://www.placeiq.com/ http://www.placeiq.com/ http://www.placeiq.com/

Paul Brenner

https://twitter.com/placeiq https://twitter.com/placeiq https://twitter.com/placeiq
https://www.facebook.com/PlaceIQ https://www.facebook.com/PlaceIQ
https://www.linkedin.com/company/placeiq https://www.linkedin.com/company/placeiq

DATA SCIENTIST

(217) 390-3033 

 

http://www.placeiq.com/2015/05/26/placeiq-named-winner-of-prestigious-2015-oracle-data-cloud-activate-award/ http://placeiq.com/2015/12/18/accuracy-vs-precision-in-location-data-mma-webinar/ http://placeiq.com/2015/12/18/accuracy-vs-precision-in-location-data-mma-webinar/ http://placeiq.com/2015/12/18/accuracy-vs-precision-in-location-data-mma-webinar/ http://placeiq.com/2015/12/18/accuracy-vs-precision-in-location-data-mma-webinar/ http://placeiq.com/2016/03/08/measuring-addressable-tv-campaigns-is-now-possible/ http://placeiq.com/2016/04/13/placeiq-joins-the-network-advertising-initiative-nai-as-100th-member/ http://placeiq.com/2016/04/13/placeiq-joins-the-network-advertising-initiative-nai-as-100th-member/ http://placeiq.com/2016/04/13/placeiq-joins-the-network-advertising-initiative-nai-as-100th-member/ http://placeiq.com/2016/04/13/placeiq-joins-the-network-advertising-initiative-nai-as-100th-member/ http://placeiq.com/2016/04/13/placeiq-joins-the-network-advertising-initiative-nai-as-100th-member/ http://pages.placeiq.com/Location-Data-Accuracy-Whitepaper-Download.html?utm_source=Signature&utm_medium=Email&utm_campaign=AccuracyWP http://placeiq.com/2016/08/03/placeiq-bolsters-location-intelligence-platform-with-mastercard-insights/ http://placeiq.com/2016/10/26/the-making-of-a-location-data-industry-milestone/ http://placeiq.com/2016/12/07/placeiq-introduces-landmark-a-groundbreaking-offering-that-delivers-access-to-the-highest-quality-location-data-for-insights-that-fuel-limitless-business-decisions/

On Sat, May 06, 2017 at 6:47 AM Pietro Pugni

<
mailto:Pietro Pugni <pi...@gmail.com>
> wrote:

<![CDATA[a, pre, code, a:link, body { word-wrap: break-word !important; }]]>

Hi all,

I am facing a strange issue on two different machines that acts like servers. Each of them runs an instance of Zeppelin installed as a system.d service.

The configuration is:

 - Ubuntu Server 16.04.2 LTS

 - Spark 2.1.0

 - Microsoft Open R 3.3.2

 - Zeppelin 0.7.1 (0.7.0 gave the same problems)

zeppelin-env.sh has the following settings:

export SPARK_HOME="/spark/home/directory"

spark-env.sh has the following settings:

export LANG="en_US"

export SPARK_DAEMON_JAVA_OPTS+=" -Dspark.local.dir=/some/dir -Dspark.eventLog.dir=/some/dir/spark-events -Dhadoop.tmp.dir=/some/dir"

export _JAVA_OPTIONS+=" -Djava.io.tmpdir=/some/dir"

spark-defaults.conf is set as:

spark.executor.memory          

21g

spark.driver.memory                     21g

spark.python.worker.memory      

4g

spark.sql.autoBroadcastJoinThreshold    0

I use Spark in stand-alone mode and it works perfectly. It also works correctly with Zeppelin but this is what happens:

1) Start zeppelin on the server using the command

service zeppelin start

2) Connect to port 8080 using Mozilla Firefox from client 

3) Insert username and password (I enabled Shiro authentication)

4) open a notebook

5) Execute the following code:

%spark.r

2+2

6) The code runs correctly and I can see that R is currently running as a process.

7) Repeat steps 2-5 after some time (let’s say 2 or 3 hours) and Zeppelin remains forever on “Running” or, if the elapsed time is higher (for example 1 day) since the last run, it returns “Error”. The “time-to-be-unresponsive” seems to be random and unpredictable. Also, R is not present in the list of running processes. Spark session remains active because I can access Spark UI from port 4040 and the application name is “Zeppelin”, so it’s the Spark instance created by Zeppelin.

I observed that sometimes I can simply restart the interpreter from Zeppelin UI, but many other times it doesn’t work and I have to restart Zeppelin (

service zeppelin restart

).

This issue afflicts both 0.7.0 and 0.7.1 but I haven’t tried with previous versions. It also happens if Zeppelin isn’t installed as a service.

I can’t provide more detail because I can’t see any error or warning in the logs.. this is really strange. 

Thank you all.

Kind regards

 Pietro Pugni