You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Rodrigo Boavida (Jira)" <ji...@apache.org> on 2021/10/12 11:49:00 UTC
[jira] [Created] (SPARK-36986) Improving external schema management flexibility

Rodrigo Boavida created SPARK-36986:
---------------------------------------

             Summary: Improving external schema management flexibility
                 Key: SPARK-36986
                 URL: https://issues.apache.org/jira/browse/SPARK-36986
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 3.3.0
            Reporter: Rodrigo Boavida


Our spark usage, requires us to build an external schema and pass it on while creating a DataSet.

While working through this, I found a couple of optimizations would improve greatly Spark's flexibility to handle external schema management.

1 - ability to retrieve a field's name and schema in one single call, requesting to return a tupple by index. 

Means extending the StructType class to support an additional method

This is what the function would look like:

/**
 * Returns the index and field structure by name.
 * If it doesn't find it, returns None.
 * Avoids two client calls/loops to obtain consolidated field info.
 *
 */
def getIndexAndFieldByName(name: String): Option[(Int, StructField)] = {
  val field = nameToField.get(name)
  if(field.isDefined) {
    Some((fieldIndex(name), field.get))
  } else {
    None
  }
}

This is particularly useful from an efficiency perspective, when we're parsing a Json structure and we want to check for every field what is the name and field type already defined in the schema

 

2 - Allowing for a dataset to be created from a schema, and passing the corresponding internal rows which the internal types map with the schema already defined externally. this greatly improves the performance by skipping

This is what the function would look like:

 

/**
 * Creates a [[Dataset]] from an RDD of spark.sql.catalyst.InternalRow. This method allows
 * the caller to create externally the InternalRow set, as we as define the schema externally.
 *
 * @since 3.3.0
 */
def createDataset(data: RDD[InternalRow], schema: StructType): DataFrame = {
  val attributes = schema.toAttributes
  val plan = LogicalRDD(attributes, data)(self)
  val qe = sessionState.executePlan(plan)
  qe.assertAnalyzed()
  new Dataset[Row](this, plan, RowEncoder(schema))
}

 

This is similar to this function:

def createDataFrame(rows: java.util.List[Row], schema: StructType): DataFrame

But doesn't depend on Spark internally creating the RDD based by inferring for example from a Json structure. Which is not useful if we're managing the schema externally.

Also skips the Catalyst conversions and corresponding object overhead, making the internal rows generation much more efficient, by being done explicitly from the caller.

 

I will create a corresponding branch for PR review, assuming that there are no concerns with the above proposals.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org