You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Joseph K. Bradley (JIRA)" <ji...@apache.org> on 2016/03/21 23:19:25 UTC
[jira] [Closed] (SPARK-6362) Broken pipe error when training a
RandomForest on a union of two RDDs
[ https://issues.apache.org/jira/browse/SPARK-6362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Joseph K. Bradley closed SPARK-6362.
------------------------------------
Resolution: Fixed
Fix Version/s: 1.3.0
I'm going to close this since it appears to be fixed (based on running it locally just now on master).
> Broken pipe error when training a RandomForest on a union of two RDDs
> ---------------------------------------------------------------------
>
> Key: SPARK-6362
> URL: https://issues.apache.org/jira/browse/SPARK-6362
> Project: Spark
> Issue Type: Bug
> Components: MLlib, PySpark
> Affects Versions: 1.2.0
> Environment: Kubuntu 14.04, local driver
> Reporter: Pavel Laskov
> Priority: Minor
> Fix For: 1.3.0
>
>
> Training a RandomForest classifier on a dataset obtained as a union of two RDDs throws a broken pipe error:
> Traceback (most recent call last):
> File "/home/laskov/code/spark-1.2.1/python/pyspark/daemon.py", line 162, in manager
> code = worker(sock)
> File "/home/laskov/code/spark-1.2.1/python/pyspark/daemon.py", line 64, in worker
> outfile.flush()
> IOError: [Errno 32] Broken pipe
> Despite an error the job runs to completion.
> The following code reproduces the error:
> from pyspark.context import SparkContext
> from pyspark.mllib.rand import RandomRDDs
> from pyspark.mllib.tree import RandomForest
> from pyspark.mllib.linalg import DenseVector
> from pyspark.mllib.regression import LabeledPoint
> import random
> if __name__ == "__main__":
> sc = SparkContext(appName="Union bug test")
> data1 = RandomRDDs.normalVectorRDD(sc,numRows=10000,numCols=200)
> data1 = data1.map(lambda x: LabeledPoint(random.randint(0,1),\
> DenseVector(x)))
> data2 = RandomRDDs.normalVectorRDD(sc,numRows=10000,numCols=200)
> data2 = data2.map(lambda x: LabeledPoint(random.randint(0,1),\
> DenseVector(x)))
> training_data = data1.union(data2)
> #training_data = training_data.repartition(2)
> model = RandomForest.trainClassifier(training_data, numClasses=2,
> categoricalFeaturesInfo={},
> numTrees=50, maxDepth=30)
> Interestingly, re-partitioning the data after the union operation rectifies the problem (uncomment the line before training in the code above).
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org