You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Huseyin Elci (Jira)" <ji...@apache.org> on 2021/02/04 20:16:00 UTC

[jira] [Commented] (SPARK-34351) Running into "Py4JJavaError" while counting to text file or list using Pyspark, Jupyter notebook

    [ https://issues.apache.org/jira/browse/SPARK-34351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17279128#comment-17279128 ] 

Huseyin Elci commented on SPARK-34351:
--------------------------------------

I used StackOverflow for this issue but I didn't find anything. I spent over the 3 days for solving of this issue.
I looked http://spark.apache.org/community.html. And It has lots of  "Py4JJavaError" error. I checked a few comment. Almost of there is not same issue or there is not solving about another error of "Py4JJavaError" 

> Running into "Py4JJavaError" while counting to text file or list using Pyspark, Jupyter notebook
> ------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-34351
>                 URL: https://issues.apache.org/jira/browse/SPARK-34351
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 2.3.1
>         Environment: PS> python --version
>  *Python 3.6.8*
> PS> jupyter --version
>  j*upyter core : 4.7.0*
>  *jupyter-notebook : 6.2.0*
>  qtconsole : 5.0.2
>  ipython : 7.16.1
>  ipykernel : 5.4.3
>  jupyter client : 6.1.11
>  jupyter lab : not installed
>  nbconvert : 6.0.7
>  ipywidgets : 7.6.3
>  nbformat : 5.1.2
>  traitlets : 4.3.3
> PS > java -version
>  *java version "1.8.0_271"*
>  Java(TM) SE Runtime Environment (build 1.8.0_271-b09)
>  Java HotSpot(TM) 64-Bit Server VM (build 25.271-b09, mixed mode)
>  
> Spark versiyon
> *spark-2.3.1-bin-hadoop2.7*
>            Reporter: Huseyin Elci
>            Priority: Major
>
> I run into the following error: 
>  Any help resolving this error is greatly appreciated.
>  *My Code 1:*
> {code:python}
> import findspark
> findspark.init("C:\Spark")
> from pyspark.sql import SparkSession
> from pyspark.conf import SparkConf
> spark = SparkSession.builder\
>         .master("local[4]")\
>         .appName("WordCount_RDD")\
>         .getOrCreate()
> sc = spark.sparkContext
> data = "D:\\05 Spark\\data\\MyArticle.txt"
> story_rdd = sc.textFile(data)
> story_rdd.count()
> {code}
> *My Code 2:* 
> {code:python}
> import findspark
> findspark.init("C:\Spark")
> from pyspark import SparkContext
> sc = SparkContext()
> mylist = [1,2,2,3,5,48,98,62,14,55]
> mylist_rdd = sc.parallelize(mylist)
> mylist_rdd.map(lambda x: x*x)
> mylist_rdd.map(lambda x: x*x).collect()
> {code}
> *ERROR:*
> I took same error code for my codes.
> {code:python}
>  ---------------------------------------------------------------------------
>  Py4JJavaError Traceback (most recent call last)
>  <ipython-input-9-1af9abd2340f> in <module>
>  ----> 1 story_rdd.count()
> C:\Spark\python\pyspark\rdd.py in count(self)
>  1071 3
>  1072 """
>  -> 1073 return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
>  1074 
>  1075 def stats(self):
> C:\Spark\python\pyspark\rdd.py in sum(self)
>  1062 6.0
>  1063 """
>  -> 1064 return self.mapPartitions(lambda x: [sum(x)]).fold(0, operator.add)
>  1065 
>  1066 def count(self):
> C:\Spark\python\pyspark\rdd.py in fold(self, zeroValue, op)
>  933 # zeroValue provided to each partition is unique from the one provided
>  934 # to the final reduce call
>  --> 935 vals = self.mapPartitions(func).collect()
>  936 return reduce(op, vals, zeroValue)
>  937
> C:\Spark\python\pyspark\rdd.py in collect(self)
>  832 """
>  833 with SCCallSiteSync(self.context) as css:
>  --> 834 sock_info = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
>  835 return list(_load_from_socket(sock_info, self._jrdd_deserializer))
>  836
> C:\Spark\python\lib\py4j-0.10.7-src.zip\py4j\java_gateway.py in __call__(self, *args)
>  1255 answer = self.gateway_client.send_command(command)
>  1256 return_value = get_return_value(
>  -> 1257 answer, self.gateway_client, self.target_id, self.name)
>  1258 
>  1259 for temp_arg in temp_args:
> C:\Spark\python\pyspark\sql\utils.py in deco(*a, **kw)
>  61 def deco(*a, **kw):
>  62 try:
>  ---> 63 return f(*a, **kw)
>  64 except py4j.protocol.Py4JJavaError as e:
>  65 s = e.java_exception.toString()
> C:\Spark\python\lib\py4j-0.10.7-src.zip\py4j\protocol.py in get_return_value(answer, gateway_client, target_id, name)
>  326 raise Py4JJavaError(
>  327 "An error occurred while calling
> {0} \{1} \{2}
> .\n".
>  --> 328 format(target_id, ".", name), value)
>  329 else:
>  330 raise Py4JError(
> Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
>  : org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 0.0 failed 1 times, most recent failure: Lost task 1.0 in stage 0.0 (TID 1, localhost, executor driver): org.apache.spark.SparkException: Python worker failed to connect back.
>  at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:148)
>  at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:76)
>  at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:117)
>  at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:86)
>  at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:64)
>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
>  at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
>  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>  at org.apache.spark.scheduler.Task.run(Task.scala:109)
>  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
>  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748)
>  Caused by: java.net.SocketTimeoutException: Accept timed out
>  at java.net.DualStackPlainSocketImpl.waitForNewConnection(Native Method)
>  at java.net.DualStackPlainSocketImpl.socketAccept(DualStackPlainSocketImpl.java:131)
>  at java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:535)
>  at java.net.PlainSocketImpl.accept(PlainSocketImpl.java:189)
>  at java.net.ServerSocket.implAccept(ServerSocket.java:545)
>  at java.net.ServerSocket.accept(ServerSocket.java:513)
>  at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:142)
>  ... 12 more
> Driver stacktrace:
>  at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1602)
>  at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1590)
>  at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1589)
>  at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>  at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1589)
>  at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
>  at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
>  at scala.Option.foreach(Option.scala:257)
>  at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)
>  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1823)
>  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1772)
>  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1761)
>  at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
>  at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)
>  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034)
>  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2055)
>  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2074)
>  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2099)
>  at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:939)
>  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>  at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
>  at org.apache.spark.rdd.RDD.collect(RDD.scala:938)
>  at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:162)
>  at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498)
>  at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
>  at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
>  at py4j.Gateway.invoke(Gateway.java:282)
>  at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
>  at py4j.commands.CallCommand.execute(CallCommand.java:79)
>  at py4j.GatewayConnection.run(GatewayConnection.java:238)
>  at java.lang.Thread.run(Thread.java:748)
>  Caused by: org.apache.spark.SparkException: Python worker failed to connect back.
>  at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:148)
>  at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:76)
>  at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:117)
>  at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:86)
>  at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:64)
>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
>  at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
>  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>  at org.apache.spark.scheduler.Task.run(Task.scala:109)
>  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
>  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  ... 1 more
>  Caused by: java.net.SocketTimeoutException: Accept timed out
>  at java.net.DualStackPlainSocketImpl.waitForNewConnection(Native Method)
>  at java.net.DualStackPlainSocketImpl.socketAccept(DualStackPlainSocketImpl.java:131)
>  at java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:535)
>  at java.net.PlainSocketImpl.accept(PlainSocketImpl.java:189)
>  at java.net.ServerSocket.implAccept(ServerSocket.java:545)
>  at java.net.ServerSocket.accept(ServerSocket.java:513)
>  at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:142)
>  ... 12 more
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org