You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "K (JIRA)" <ji...@apache.org> on 2016/08/24 19:27:20 UTC
[jira] [Created] (SPARK-17223) "grows beyond 64 KB" with data frame with many columns

K created SPARK-17223:
-------------------------

             Summary: "grows beyond 64 KB" with data frame with many columns
                 Key: SPARK-17223
                 URL: https://issues.apache.org/jira/browse/SPARK-17223
             Project: Spark
          Issue Type: Bug
          Components: ML, PySpark
    Affects Versions: 2.0.0, 2.1.0
            Reporter: K


Hi everyone, 

We have a dataset with ~500 column. If I called a LabelIndexer on it and tried to print out the first line, it fails with "grows beyond 64KB" error below. My original dataset had >20K rows, I stripped to 100 rows, but didn't help. Eventually, we want to feed LabelIndexer, VectorAssembler and Random Forest into Pipeline but  we are not having much luck here :( We tried with 2.0.0, and 2.1.0(snapshot as of 8/23). The problem is reproducible with the data file here: 
https://drive.google.com/file/d/0B2zl8xCBUVh6TFZDd3ZSUTNsam8/view?usp=sharing

Environment: Cluster with 2 nodes (CentOS, 64GB RAM and 8 cores each)

Code:
k_temp7 = load_csv_file('spark_bug.csv')

from pyspark.ml import Pipeline
from pyspark.ml.feature import OneHotEncoder, StringIndexer, VectorAssembler
from pyspark.ml.classification import RandomForestClassifier

# # Fit on whole dataset to include all labels in index.
labelIndexer = StringIndexer(inputCol="label", outputCol="indexedLabel") \
                             .setHandleInvalid("skip") \
                             .fit(k_temp7)
            
weights = [0.70, 0.15, 0.15]
seed = 42
df_train, df_validation, df_test = k_temp7.randomSplit(weights, seed)        

#feature_assembler = VectorAssembler(inputCols=["SomeUnknownEmptyCategory"], \
#                                     outputCol="train_features") 

# # Train a RandomForest model.
rf = RandomForestClassifier(labelCol="indexedLabel", featuresCol="train_features", \
                             predictionCol="prediction", numTrees=10)
pipeline = Pipeline(stages=[labelIndexer])  #, feature_assembler, rf])
model = pipeline.fit(df_train)

# # Measure performance of the model on validation dataset   
model_output = model.transform(df_train)

#this fails
print model_output.first()

ERROR:

Py4JJavaError: An error occurred while calling z:org.apache.spark.sql.execution.python.EvaluatePython.takeAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 250.0 failed 4 times, most recent failure: Lost task 0.3 in stage 250.0 (TID 4666, ip): java.util.concurrent.ExecutionException: java.lang.Exception: failed to compile: org.codehaus.janino.JaninoRuntimeException: Code of method "compare(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I" of class "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org