You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "brian wang (JIRA)" <ji...@apache.org> on 2018/04/26 06:01:00 UTC

[jira] [Created] (SPARK-24095) Spark Streaming performance drastically drops when when saving dataframes with withColumn

brian wang created SPARK-24095:
----------------------------------

             Summary: Spark Streaming performance drastically drops when when saving dataframes with withColumn
                 Key: SPARK-24095
                 URL: https://issues.apache.org/jira/browse/SPARK-24095
             Project: Spark
          Issue Type: Bug
          Components: Structured Streaming
    Affects Versions: 2.3.0
            Reporter: brian wang


We have a Spark Streaming application which is streaming data from Kafka and ingesting the data in HDFS after a series of transformations. We are using Spark SQL to do the transformations and storing the data into HDFS at two stages. The ingestion to Spark which we do at the second stage is drastically reducing the performance of the application.
There are close to 40 Million transactions per hour in the incoming data. WE have observed a performance bottleneck in the write to hdfs.
Can you please help us optimize the application performance?
This is a critical issue since it is holding our deployment to production cluster and we are running behind the schedule in production deployment.

 

Answer: First Stage Save

test_Transformed_DOW.cache().withColumn("test_class_map", udf(test_class_map, StringType())(array(test_class))).write.mode("append").option("header","true").csv("/hive/warehouse/test")

Second Stage Save

test_Data_Final=spark.sql("select test1,test2,test3...... when int(seats)>=2 then 1 when int(seats) < 2 then 0 end as seats from test_Data_Unpivoted").write.format("parquet").mode("append").saveAsTable("test_Data_Output")

It is the first save stage which is slowing our spark application's performance if we enable it. If we disable it, the application seems to catch up with the incoming data flow.

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org