You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Joseph K. Bradley (JIRA)" <ji...@apache.org> on 2016/03/21 23:19:25 UTC
[jira] [Closed] (SPARK-6362) Broken pipe error when training a RandomForest on a union of two RDDs

     [ https://issues.apache.org/jira/browse/SPARK-6362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Joseph K. Bradley closed SPARK-6362.
------------------------------------
       Resolution: Fixed
    Fix Version/s: 1.3.0

I'm going to close this since it appears to be fixed (based on running it locally just now on master).

> Broken pipe error when training a RandomForest on a union of two RDDs
> ---------------------------------------------------------------------
>
>                 Key: SPARK-6362
>                 URL: https://issues.apache.org/jira/browse/SPARK-6362
>             Project: Spark
>          Issue Type: Bug
>          Components: MLlib, PySpark
>    Affects Versions: 1.2.0
>         Environment: Kubuntu 14.04, local driver
>            Reporter: Pavel Laskov
>            Priority: Minor
>             Fix For: 1.3.0
>
>
> Training a RandomForest classifier on a dataset obtained as a union of two RDDs throws a broken pipe error:
> Traceback (most recent call last):
>   File "/home/laskov/code/spark-1.2.1/python/pyspark/daemon.py", line 162, in manager
>     code = worker(sock)
>   File "/home/laskov/code/spark-1.2.1/python/pyspark/daemon.py", line 64, in worker
>     outfile.flush()
> IOError: [Errno 32] Broken pipe
> Despite an error the job runs to completion. 
> The following code reproduces the error:
> from pyspark.context import SparkContext
> from pyspark.mllib.rand import RandomRDDs
> from pyspark.mllib.tree import RandomForest
> from pyspark.mllib.linalg import DenseVector
> from pyspark.mllib.regression import LabeledPoint
> import random
> if __name__ == "__main__":
>     sc = SparkContext(appName="Union bug test")
>     data1 = RandomRDDs.normalVectorRDD(sc,numRows=10000,numCols=200)
>     data1 = data1.map(lambda x: LabeledPoint(random.randint(0,1),\
>                                              DenseVector(x)))
>     data2 = RandomRDDs.normalVectorRDD(sc,numRows=10000,numCols=200)
>     data2 = data2.map(lambda x: LabeledPoint(random.randint(0,1),\
>                                             DenseVector(x)))
>     training_data = data1.union(data2)
>     #training_data = training_data.repartition(2)
>     model = RandomForest.trainClassifier(training_data, numClasses=2,
>                                          categoricalFeaturesInfo={},
>                                          numTrees=50, maxDepth=30)
> Interestingly, re-partitioning the data after the union operation rectifies the problem (uncomment the line before training in the code above). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org