You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Muhammad Kaleem Ullah (Jira)" <ji...@apache.org> on 2022/08/05 22:52:00 UTC
[jira] [Created] (SPARK-39994) How to write (save) PySpark dataframe containing vector column?

Muhammad Kaleem Ullah created SPARK-39994:
---------------------------------------------

             Summary: How to write (save) PySpark dataframe containing vector column?
                 Key: SPARK-39994
                 URL: https://issues.apache.org/jira/browse/SPARK-39994
             Project: Spark
          Issue Type: Bug
          Components: PySpark
    Affects Versions: 3.3.0
            Reporter: Muhammad Kaleem Ullah
             Fix For: 3.3.0
         Attachments: df.PNG, error.PNG

I'm trying to same the PySpark dataframe after transforming it using ML Pipeline. But when I save it the weird error is triggered every time. Here are the columns of this dataframe:

|-- label: integer (nullable = true)

|-- dest_index: double (nullable = false)

|-- dest_fact: vector (nullable = true)

|-- carrier_index: double (nullable = false)

|-- carrier_fact: vector (nullable = true)

|-- features: vector (nullable = true)

And the following error occurs when trying to save this dataframe that contains vector data:
{code:java}
// training.write.parquet("training_files.parquet", mode = "overwrite") {code}
{noformat}
Py4JJavaError: An error occurred while calling o440.parquet. : org.apache.spark.SparkException: Job aborted. at org.apache.spark.sql.errors.QueryExecutionErrors$.jobAbortedError(QueryExecutionErrors.scala:638) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:278) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:186) at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:113) at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:111) at org.apache.spark.sql.execution.command.DataWritingCommandExec.executeCollect(commands.scala:125) at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:98) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:109) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:169) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:95) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:98)
...
{noformat}
 

I tried to use differently available {{winutils}} for Hadoop from [this GitHub repository|https://github.com/cdarlint/winutils] but with not much luck. Please help me in this regard. How can I save this dataframe so that I can read it in any other jupyter notebook file? Feel free to ask any questions. Thanks



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org