You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Michael Armbrust (JIRA)" <ji...@apache.org> on 2014/06/10 10:35:01 UTC

[jira] [Created] (SPARK-2094) Ensure exactly once semantics for DDL / Commands

Michael Armbrust created SPARK-2094:
---------------------------------------

             Summary: Ensure exactly once semantics for DDL / Commands
                 Key: SPARK-2094
                 URL: https://issues.apache.org/jira/browse/SPARK-2094
             Project: Spark
          Issue Type: Bug
          Components: SQL
            Reporter: Michael Armbrust
             Fix For: 1.1.0


>From [~lian cheng]...
The constraints presented here are:

 * The side effect of a command SchemaRDD should take place eagerly;
 * The side effect of a command SchemaRDD should take place once and only once;
 * When .collect() method is called, something meaningful, usually the output message lines of the command, should be presented.

Then how about adding a lazy field inside all the physical command nodes to wrap up the side effect and hold the command output? Take the SetCommandPhysical as an example:
{code}
trait PhysicalCommand(@transient context: SQLContext) {
   lazy val commandOutput: Any
}

case class SetCommandPhysical(
    key: Option[String], value: Option[String], output: Seq[Attribute])(
    @transient context: SQLContext)
  extends PhysicalCommand(context)
  with PhysicalCommand {

  override lazy val commandOutput = {
    // Perform the side effect, and record appropriate output
    ???
  }

  def execute(): RDD[Row] = {
    val row = new GenericRow(Array[Any](commandOutput))
    context.sparkContext.parallelize(row, 1)
  }
}
{code}
In this way, all the constraints are met:

 * Eager evaluation: done by the toRdd call in SchemaRDDLike (PR #948),
 * Side effect should take place once and only once: ensured by the lazy commandOutput field,
 * Present meaningful output as RDD contents: command output is held by commandOutput and returned in execute().

An additional benefit is that, side effect logic of all the commands can be implemented within their own physical command nodes, instead of adding special cases inside SQLContext.toRdd and/or HiveContext.toRdd.



--
This message was sent by Atlassian JIRA
(v6.2#6252)