You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "K (JIRA)" <ji...@apache.org> on 2016/08/24 19:29:20 UTC
[jira] [Updated] (SPARK-17223) "grows beyond 64 KB" with data frame
with many columns
[ https://issues.apache.org/jira/browse/SPARK-17223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
K updated SPARK-17223:
----------------------
Description:
Hi everyone,
We have a dataset with ~500 column. If I called a LabelIndexer on it and tried to print out the first line, it fails with "grows beyond 64KB" error below. My original dataset had >20K rows, I stripped to 100 rows, but didn't help. Eventually, we want to feed LabelIndexer, VectorAssembler and Random Forest into Pipeline but we are not having much luck here :( We tried with 2.0.0, and 2.1.0(snapshot as of 8/23). The problem is reproducible with the data file here:
https://drive.google.com/file/d/0B2zl8xCBUVh6TFZDd3ZSUTNsam8/view?usp=sharing
Environment: Cluster with 2 nodes (CentOS, 64GB RAM and 8 cores each)
Code is here (JIRA corrupted it so moved to google doc)
https://docs.google.com/document/d/19unfhSMMCjoXqhmFOA1omm4V2wHaraY0RxZesbQluZU/edit?usp=sharing
ERROR:
Py4JJavaError: An error occurred while calling z:org.apache.spark.sql.execution.python.EvaluatePython.takeAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 250.0 failed 4 times, most recent failure: Lost task 0.3 in stage 250.0 (TID 4666, ip): java.util.concurrent.ExecutionException: java.lang.Exception: failed to compile: org.codehaus.janino.JaninoRuntimeException: Code of method "compare(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I" of class "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB
was:
Hi everyone,
We have a dataset with ~500 column. If I called a LabelIndexer on it and tried to print out the first line, it fails with "grows beyond 64KB" error below. My original dataset had >20K rows, I stripped to 100 rows, but didn't help. Eventually, we want to feed LabelIndexer, VectorAssembler and Random Forest into Pipeline but we are not having much luck here :( We tried with 2.0.0, and 2.1.0(snapshot as of 8/23). The problem is reproducible with the data file here:
https://drive.google.com/file/d/0B2zl8xCBUVh6TFZDd3ZSUTNsam8/view?usp=sharing
Environment: Cluster with 2 nodes (CentOS, 64GB RAM and 8 cores each)
Code:
k_temp7 = load_csv_file('spark_bug.csv')
from pyspark.ml import Pipeline
from pyspark.ml.feature import OneHotEncoder, StringIndexer, VectorAssembler
from pyspark.ml.classification import RandomForestClassifier
# # Fit on whole dataset to include all labels in index.
labelIndexer = StringIndexer(inputCol="label", outputCol="indexedLabel") \
.setHandleInvalid("skip") \
.fit(k_temp7)
weights = [0.70, 0.15, 0.15]
seed = 42
df_train, df_validation, df_test = k_temp7.randomSplit(weights, seed)
#feature_assembler = VectorAssembler(inputCols=["SomeUnknownEmptyCategory"], \
# outputCol="train_features")
# # Train a RandomForest model.
rf = RandomForestClassifier(labelCol="indexedLabel", featuresCol="train_features", \
predictionCol="prediction", numTrees=10)
pipeline = Pipeline(stages=[labelIndexer]) #, feature_assembler, rf])
model = pipeline.fit(df_train)
# # Measure performance of the model on validation dataset
model_output = model.transform(df_train)
#this fails
print model_output.first()
ERROR:
Py4JJavaError: An error occurred while calling z:org.apache.spark.sql.execution.python.EvaluatePython.takeAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 250.0 failed 4 times, most recent failure: Lost task 0.3 in stage 250.0 (TID 4666, ip): java.util.concurrent.ExecutionException: java.lang.Exception: failed to compile: org.codehaus.janino.JaninoRuntimeException: Code of method "compare(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I" of class "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB
> "grows beyond 64 KB" with data frame with many columns
> ------------------------------------------------------
>
> Key: SPARK-17223
> URL: https://issues.apache.org/jira/browse/SPARK-17223
> Project: Spark
> Issue Type: Bug
> Components: ML, PySpark
> Affects Versions: 2.0.0, 2.1.0
> Reporter: K
>
> Hi everyone,
> We have a dataset with ~500 column. If I called a LabelIndexer on it and tried to print out the first line, it fails with "grows beyond 64KB" error below. My original dataset had >20K rows, I stripped to 100 rows, but didn't help. Eventually, we want to feed LabelIndexer, VectorAssembler and Random Forest into Pipeline but we are not having much luck here :( We tried with 2.0.0, and 2.1.0(snapshot as of 8/23). The problem is reproducible with the data file here:
> https://drive.google.com/file/d/0B2zl8xCBUVh6TFZDd3ZSUTNsam8/view?usp=sharing
> Environment: Cluster with 2 nodes (CentOS, 64GB RAM and 8 cores each)
> Code is here (JIRA corrupted it so moved to google doc)
> https://docs.google.com/document/d/19unfhSMMCjoXqhmFOA1omm4V2wHaraY0RxZesbQluZU/edit?usp=sharing
> ERROR:
> Py4JJavaError: An error occurred while calling z:org.apache.spark.sql.execution.python.EvaluatePython.takeAndServe.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 250.0 failed 4 times, most recent failure: Lost task 0.3 in stage 250.0 (TID 4666, ip): java.util.concurrent.ExecutionException: java.lang.Exception: failed to compile: org.codehaus.janino.JaninoRuntimeException: Code of method "compare(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I" of class "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org