You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Shea Parkes (JIRA)" <ji...@apache.org> on 2016/07/16 16:34:20 UTC
[jira] [Commented] (SPARK-12261) pyspark crash for large dataset

    [ https://issues.apache.org/jira/browse/SPARK-12261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15380846#comment-15380846 ] 

Shea Parkes commented on SPARK-12261:
-------------------------------------

I believe I'm hitting the same bug.  I'm also running on Windows in Standalone Cluster mode.  Unfortunately, the error is non-deterministic (i.e. I've had no luck in creating an always reproducible scenario).

I promise I have plenty of RAM, and I'm not working with *that* big of data.

I too can try and capture better logging.  As Chris hinted to though, which logs are you looking for?  I can grab logs from the master, workers, and application, but it doesn't appear obvious where to grab logs from the python workers that are spawned.

I'm happy to read documentation (and code) to try and chase down appropriate logs, but I haven't found any helpful directions yet.

> pyspark crash for large dataset
> -------------------------------
>
>                 Key: SPARK-12261
>                 URL: https://issues.apache.org/jira/browse/SPARK-12261
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 1.5.2
>         Environment: windows
>            Reporter: zihao
>
> I tried to import a local text(over 100mb) file via textFile in pyspark, when i ran data.take(), it failed and gave error messages including:
> 15/12/10 17:17:43 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; aborting job
> Traceback (most recent call last):
>   File "E:/spark_python/test3.py", line 9, in <module>
>     lines.take(5)
>   File "D:\spark\spark-1.5.2-bin-hadoop2.6\python\pyspark\rdd.py", line 1299, in take
>     res = self.context.runJob(self, takeUpToNumLeft, p)
>   File "D:\spark\spark-1.5.2-bin-hadoop2.6\python\pyspark\context.py", line 916, in runJob
>     port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, partitions)
>   File "C:\Anaconda2\lib\site-packages\py4j\java_gateway.py", line 813, in __call__
>     answer, self.gateway_client, self.target_id, self.name)
>   File "D:\spark\spark-1.5.2-bin-hadoop2.6\python\pyspark\sql\utils.py", line 36, in deco
>     return f(*a, **kw)
>   File "C:\Anaconda2\lib\site-packages\py4j\protocol.py", line 308, in get_return_value
>     format(target_id, ".", name), value)
> py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.net.SocketException: Connection reset by peer: socket write error
> Then i ran the same code for a small text file, this time .take() worked fine.
> How can i solve this problem?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org