You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by CCInCharge <ch...@gmail.com> on 2018/01/27 23:17:21 UTC

Custom Catalyst Optimizer Strategy for DataFrame Writes?

I've been working with Datastax's spark-cassandra-connector, and have noticed
that, when creating batches of DataFrame Rows to write to database, write
throughput is increased substantially and overall task completion time is
decreased if the user sorts the DataFrame on Cassandra partition key prior
to writing to database.

Saving DataFrames from Spark to Cassandra, using the connector, is performed
by calling the DataFrame API's write method, and setting the output format
to "org.apache.spark.sql.cassandra" - this makes the DataFrameWriter write
data to Cassandra using the connector.

I'm thinking that the spark-cassandra-connector could automatically sort a
DataFrame by Cassandra partition key before it writes data to the database.
I am not very familiar with the Catalyst, but I was thinking that one
possibility is to create a custom Catalyst optimization (extraStrategies or
extraOptimizations) in the connector that will automatically do this. Is
this possible/valid, or am I misunderstanding what is possible with custom
Catalyst optimizations?



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org