You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by marmbrus <gi...@git.apache.org> on 2014/08/04 05:14:49 UTC

[GitHub] spark pull request: [SPARK-2816][SQL] Type-safe SQL Queries

GitHub user marmbrus opened a pull request:

    https://github.com/apache/spark/pull/1759

    [SPARK-2816][SQL] Type-safe SQL Queries

    **This is an experimental feature of Spark SQL and is intended primarily to get feedback from users.  APIs may change in future versions.**
    
    This PR adds a string interpolator that allows users to run Spark SQL queries that return type-safe
    results in Scala. SQL interpolation is invoked by prefixing a string literal with `sql`, and supports including RDDs using `$`. For example:
    
    ```scala
    val sqlContext = new org.apache.spark.sql.SQLContext(sc)
    import sqlContext._
    
    case class Person(firstName: String, lastName: String, age: Int)
    val people = sc.makeRDD(Person("Michael", "Armbrust", 30) :: Nil)
    
    val michaels = sql"SELECT * FROM $people WHERE firstName = 'Michael'"
    ```
    
    The result RDDs of interpolated SQL queries contain [Scala records](https://github.com/scala-records/scala-records) that have been *refined* with the output schema of the query.  This refinement means that you can access the columns of the result as you would normal fields of objects in scala, and that these fields will return the correct type.  Continuing the previous example:
    ```scala
    assert(michaels.first().firstName == "Michael")
    ```
    
    You can also use interpolation to include labmda functions that are in scope as UDFs.
    
    ```scala
    import java.util.Calendar
    val birthYear = (age: Int) => Calendar.getInstance().get(Calendar.YEAR) - age
    val years = sql"SELECT $birthYear(age) FROM $people"
    ```
    
    Results can also be refined into existing case class types when the names of the columns match up with the arguments to the class's constructor.
    ```scala
    case class Employee(name: String, birthYear: Int)
    val employees: RDD[Employee] =
      sql"SELECT lastName AS name, $birthYear(age) AS birthYear FROM $people".map(_.to[Employee])
    ```
    
    Known limitations:
     - SQL Interpolation will only work then the included RDDs are of case classes and the type of the case class can be determined statically at compile time.
     - Null values for primitive columns will raise an Exception.
     - Escapes in strings may not be handled correctly.
     - Doesn't work with `"""` and new lines
    
    Thanks to @gzm0 @vjovanov @hubertp @densh for Scala records and @ahirreddy for the initial work on the interpolator.
    
    TODO:
     - [ ] Maven build

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/marmbrus/spark typedSql

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/1759.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1759
    
----
commit 2fd5a85d1af4566cb1f9505c2a722aecc8468a11
Author: Michael Armbrust <mi...@databricks.com>
Date:   2014-07-07T05:40:35Z

    WIP: Typed SQL queries

commit 457d699e6f8d16c07e743e5e35a37dbe0e24f30d
Author: Michael Armbrust <mi...@databricks.com>
Date:   2014-07-07T07:20:06Z

    Now with more than one relation.

commit c6c60e38cd7bb8b8878bf1e010c910e88bb372c5
Author: Tobias Schlatter <to...@meisch.ch>
Date:   2014-07-08T11:41:57Z

    Remove intermediate map for records. Allow serialization

commit 3d4ce6729dbae4c9206ee225792306be25531f0c
Author: Michael Armbrust <mi...@databricks.com>
Date:   2014-07-11T23:00:24Z

    Merge pull request #4 from gzm0/typedSql
    
    Remove intermediate map for records. Allow serialization

commit 157d242465cfa945842c7e268f96d250e28962fa
Author: Tobias Schlatter <to...@meisch.ch>
Date:   2014-07-17T11:37:42Z

    Add specialization to record implementation

commit 24f8d1690990e45d1add37febe4c0ab661075b46
Author: Michael Armbrust <mi...@databricks.com>
Date:   2014-07-22T01:19:32Z

    Merge pull request #5 from gzm0/typedSql
    
    Add specialization to record implementation

commit ac067cb4f7577f9fed145a42974b1e2e6e51d14d
Author: Michael Armbrust <mi...@databricks.com>
Date:   2014-07-22T01:39:26Z

    Add nested test case

commit 83dd0928db6d1109a9290dd14e7208c90ee75c60
Author: Tobias Schlatter <to...@meisch.ch>
Date:   2014-07-22T12:19:59Z

    Fix records version to 0.1

commit ae5ecaf56fe2dab90327635dcc58e59ab236bb4d
Author: Tobias Schlatter <to...@meisch.ch>
Date:   2014-07-22T12:20:53Z

    Handle nested fields

commit b38fef3b7520d68668c235922ed229d2e0a5b20f
Author: Tobias Schlatter <to...@meisch.ch>
Date:   2014-07-22T12:31:08Z

    Refactor ScalaReflection to support compile-time reflection

commit 49be122631468b858b5ba131c0b3c5fc51c05db3
Author: Michael Armbrust <mi...@databricks.com>
Date:   2014-07-24T02:09:42Z

    Merge pull request #7 from gzm0/refactor-scala-reflection
    
    Refactor scala reflection

commit e4f8c49eaa52142bcbd9158253cd30f55b04f323
Author: Michael Armbrust <mi...@databricks.com>
Date:   2014-08-03T22:44:35Z

    Merge remote-tracking branch 'origin/master' into typedSql
    
    Conflicts:
    	project/SparkBuild.scala
    	sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala

commit 4d62fb58fac6bafd1ee65bea62d843b6f2106350
Author: Michael Armbrust <mi...@databricks.com>
Date:   2014-08-03T23:28:26Z

    Merge remote-tracking branch 'marmbrus/typedSql' into typedSql
    
    Conflicts:
    	project/SparkBuild.scala

commit d64c860df8e2fe8b7d14190ebd160c2c1d312c88
Author: Michael Armbrust <mi...@databricks.com>
Date:   2014-08-04T01:24:10Z

    Add udf support.

commit 5b3ab551c9eb804971b24f9291504f1d5703223c
Author: Michael Armbrust <mi...@databricks.com>
Date:   2014-08-04T01:38:49Z

    Docs and private.

commit ce7dd36f1fcbde43bd22c5fd12006059d32ae5f0
Author: Michael Armbrust <mi...@databricks.com>
Date:   2014-08-04T01:40:53Z

    Merge remote-tracking branch 'origin/master' into typedSql
    
    Conflicts:
    	core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala

commit 760466a4d515fb2cec37e47f0943c32a5d274c7a
Author: Michael Armbrust <mi...@databricks.com>
Date:   2014-08-04T01:41:24Z

    spurious change

commit 2b73b47653f350ca9911243f8b23c1218512660e
Author: Michael Armbrust <mi...@databricks.com>
Date:   2014-08-04T02:52:45Z

    some quote handling, case sensitivity

commit ca471036a48a6146bbf276d58714db9e511ddfb1
Author: Michael Armbrust <mi...@databricks.com>
Date:   2014-08-04T02:53:58Z

    printlns

commit 6e33920c4c046a80195ee168f0d0d731fc8306af
Author: Michael Armbrust <mi...@databricks.com>
Date:   2014-08-04T03:06:21Z

    formatting

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [WIP][SPARK-2816][SQL] Type-safe SQL Queries

Posted by gzm0 <gi...@git.apache.org>.

Github user gzm0 commented on the pull request:

    https://github.com/apache/spark/pull/1759#issuecomment-51741900
  
    We could add methods to records that let yoi specify the target type:
    
    ```x.name[String]```
    
    It's unclear though how this could be converted into case classes automatically, for example. 
    
    Another option would be to query the DB at compile time (throgh a `EXPLAIN`). It is a requirement anyway to have a DB aligned with the code, so making a DB available at compile time is just a matter of configuration. 
    
    On the other hand, in terms of "determinism" of what the compiler yields, this is sub-optimal.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [WIP][SPARK-2816][SQL] Type-safe SQL Queries

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1759#issuecomment-55517473
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20303/consoleFull) for   PR 1759 at commit [`5d6136c`](https://github.com/apache/spark/commit/5d6136cf92652d621b44bda5db437be01c55a368).
     * This patch **fails** unit tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `trait ScalaReflection `
      * `  case class Schema(dataType: DataType, nullable: Boolean)`
      * `  class Macros[C <: Context](val c: C) extends ScalaReflection `
      * `    trait InterpolatedItem `
      * `    case class InterpolatedUDF(index: Int, expr: c.Expr[Any], returnType: DataType)`
      * `    case class InterpolatedTable(index: Int, expr: c.Expr[Any], schema: StructType)`
      * `    case class RecSchema(name: String, index: Int, cType: DataType, tpe: Type)`
      * `      case class ImplSchema(name: String, tpe: Type, impl: Tree)`
      * `trait TypedSQL `
      * `  implicit class SQLInterpolation(val strCtx: StringContext) `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [WIP][SPARK-2816][SQL] Type-safe SQL Queries

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1759#issuecomment-55512891
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20286/consoleFull) for   PR 1759 at commit [`f170b0f`](https://github.com/apache/spark/commit/f170b0f5e429a4fba8f05bb65d37d0ed5039a9ab).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [WIP][SPARK-2816][SQL] Type-safe SQL Queries

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1759#discussion_r15771735
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/TypedSql.scala ---
    @@ -0,0 +1,202 @@
    +package org.apache.spark.sql
    +
    +import org.apache.spark.sql.catalyst.analysis._
    +import org.apache.spark.sql.catalyst.expressions.{Expression, ScalaUdf, AttributeReference}
    +import org.apache.spark.sql.catalyst.plans.logical.LocalRelation
    +import org.apache.spark.sql.catalyst.types._
    +
    +import scala.language.experimental.macros
    +import scala.language.existentials
    +
    +import records._
    +import Macros.RecordMacros
    +
    +import org.apache.spark.annotation.Experimental
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.sql.catalyst.{SqlParser, ScalaReflection}
    +
    +/**
    + * A collection of Scala macros for working with SQL in a type-safe way.
    + */
    +private[sql] object SQLMacros {
    +  import scala.reflect.macros._
    +
    +  def sqlImpl(c: Context)(args: c.Expr[Any]*) =
    +    new Macros[c.type](c).sql(args)
    +
    +  case class Schema(dataType: DataType, nullable: Boolean)
    +
    +  class Macros[C <: Context](val c: C) extends ScalaReflection {
    +    val universe: c.universe.type = c.universe
    +
    +    import c.universe._
    +
    +    val rowTpe = tq"_root_.org.apache.spark.sql.catalyst.expressions.Row"
    +
    +    val rMacros = new RecordMacros[c.type](c)
    +
    +    trait InterpolatedItem {
    +      def placeholderName: String
    +      def registerCode: Tree
    +      def localRegister(catalog: Catalog, registry: FunctionRegistry)
    +    }
    +
    +    case class InterpolatedUDF(index: Int, expr: c.Expr[Any], returnType: DataType)
    +      extends InterpolatedItem{
    +
    +      val placeholderName = s"func$index"
    +
    +      def registerCode = q"""registerFunction($placeholderName, $expr)"""
    +
    +      def localRegister(catalog: Catalog, registry: FunctionRegistry) = {
    +        registry.registerFunction(
    +          placeholderName, (_: Seq[Expression]) => ScalaUdf(null, returnType, Nil))
    +      }
    +    }
    +
    +    case class InterpolatedTable(index: Int, expr: c.Expr[Any], schema: StructType)
    +      extends InterpolatedItem{
    +
    +      val placeholderName = s"table$index"
    +
    +      def registerCode = q"""$expr.registerTempTable($placeholderName)"""
    +
    +      def localRegister(catalog: Catalog, registry: FunctionRegistry) = {
    +        catalog.registerTable(None, placeholderName, LocalRelation(schema.toAttributes :_*))
    +      }
    +    }
    +
    +    case class RecSchema(name: String, index: Int, cType: DataType, tpe: Type)
    +
    +    def sql(args: Seq[c.Expr[Any]]) = {
    +
    +      val q"""
    +        $interpName(
    +          scala.StringContext.apply(..$rawParts))""" = c.prefix.tree
    +
    +      //rawParts.map(_.toString).foreach(println)
    +
    +      val parts =
    +        rawParts.map(
    +          _.toString.stripPrefix("\"")
    +           .replaceAll("\\\\", "")
    +           .stripSuffix("\""))
    +
    +      val interpolatedArguments = args.zipWithIndex.map { case (arg, i) =>
    +        // println(arg + " " + arg.actualType)
    +        arg.actualType match {
    +          case TypeRef(_, _, Seq(schemaType)) =>
    +            InterpolatedTable(i, arg, schemaFor(schemaType).dataType.asInstanceOf[StructType])
    +          case TypeRef(_, _, Seq(inputType, outputType)) =>
    +            InterpolatedUDF(i, arg, schemaFor(outputType).dataType)
    +        }
    +      }
    +
    +      val query = parts(0) + args.indices.map { i =>
    +        interpolatedArguments(i).placeholderName + parts(i + 1)
    +      }.mkString("")
    +
    +      val parser = new SqlParser()
    +      val logicalPlan = parser(query)
    +      val catalog = new SimpleCatalog(true)
    +      val functionRegistry = new SimpleFunctionRegistry
    +      val analyzer = new Analyzer(catalog, functionRegistry, true)
    +
    +      interpolatedArguments.foreach(_.localRegister(catalog, functionRegistry))
    +      val analyzedPlan = analyzer(logicalPlan)
    +
    +      val fields = analyzedPlan.output.map(attr => (attr.name, attr.dataType))
    +      val record = genRecord(q"row", fields)
    +
    +      val tree = q"""
    +        ..${interpolatedArguments.map(_.registerCode)}
    +        val result = sql($query)
    +        result.map(row => $record)
    +      """
    +
    +      // println(tree)
    +      c.Expr(tree)
    +    }
    +
    +    // TODO: Handle nullable fields
    --- End diff --
    
    Yeah, this is a good question about our interfaces.  I see a couple of ways we could handle this:
     - Have a separate isNull method and make calling the primitive accessor invalid when that method returns true.  If a user fails to check we throw an exception.
     - Box everything
     - Use nullability information as follows:  Return `Option[Type]` when the attribute is nullable, return the primitive when it is not nullable.  Right now we don't do a great job in the optimizer of propagating nullability information, but overtime this should get better.  That way we could avoid the cost of `Option` on any attribute that was involved in an predicate that would prevent it from being null.
    
    Personally I like the last option the best.  It makes it very explicit to users when things could be null, and still gives them a way to get high-performance access when a primitive value cannot be null.  It however does introduce some possible confusion (i.e. changing the query in subtle ways, such as adding a predicate, could change the return types). This approach also requires the most work to be done improving catalyst's analysis.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [WIP][SPARK-2816][SQL] Type-safe SQL Queries

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on the pull request:

    https://github.com/apache/spark/pull/1759#issuecomment-51810035
  
    I was planning to query the database or file at compile time for these sorts of data sources.  While you are right that this is less `deterministic`, its not clear to me that it is desirable to have a compiler that deterministically allows you to write programs that don't line up with the schema of the data.  If the schema changes and my program is now invalid, I want the compilation to fail!
    
    Another note: this is not intended as the only interface to Spark SQL, and I think we should plan to support the less magical interface long term for cases where determining the schema at compile time is not feasible.
    
    Finally, I think the time when this functionality will be the most useful is in the interactive Spark Shell.  In these cases you want the code to be as concise as possible, and the line between "compilation" and "execution" is pretty blurry already.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [WIP][SPARK-2816][SQL] Type-safe SQL Queries

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1759#issuecomment-55512895
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20286/consoleFull) for   PR 1759 at commit [`f170b0f`](https://github.com/apache/spark/commit/f170b0f5e429a4fba8f05bb65d37d0ed5039a9ab).
     * This patch **fails** unit tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class RatingDeserializer(FramedSerializer):`
      * `trait ScalaReflection `
      * `  case class Schema(dataType: DataType, nullable: Boolean)`
      * `  class Macros[C <: Context](val c: C) extends ScalaReflection `
      * `    trait InterpolatedItem `
      * `    case class InterpolatedUDF(index: Int, expr: c.Expr[Any], returnType: DataType)`
      * `    case class InterpolatedTable(index: Int, expr: c.Expr[Any], schema: StructType)`
      * `    case class RecSchema(name: String, index: Int, cType: DataType, tpe: Type)`
      * `      case class ImplSchema(name: String, tpe: Type, impl: Tree)`
      * `trait TypedSQL `
      * `  implicit class SQLInterpolation(val strCtx: StringContext) `
      * `  class Encoder[T <: NativeType](columnType: NativeColumnType[T]) extends compression.Encoder[T] `
      * `  class Encoder[T <: NativeType](columnType: NativeColumnType[T]) extends compression.Encoder[T] `
      * `  class Encoder[T <: NativeType](columnType: NativeColumnType[T]) extends compression.Encoder[T] `
      * `  class Encoder extends compression.Encoder[IntegerType.type] `
      * `  class Decoder(buffer: ByteBuffer, columnType: NativeColumnType[IntegerType.type])`
      * `  class Encoder extends compression.Encoder[LongType.type] `
      * `  class Decoder(buffer: ByteBuffer, columnType: NativeColumnType[LongType.type])`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [WIP][SPARK-2816][SQL] Type-safe SQL Queries

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1759#discussion_r15790708
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/TypedSql.scala ---
    @@ -0,0 +1,202 @@
    +package org.apache.spark.sql
    +
    +import org.apache.spark.sql.catalyst.analysis._
    +import org.apache.spark.sql.catalyst.expressions.{Expression, ScalaUdf, AttributeReference}
    +import org.apache.spark.sql.catalyst.plans.logical.LocalRelation
    +import org.apache.spark.sql.catalyst.types._
    +
    +import scala.language.experimental.macros
    +import scala.language.existentials
    +
    +import records._
    +import Macros.RecordMacros
    +
    +import org.apache.spark.annotation.Experimental
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.sql.catalyst.{SqlParser, ScalaReflection}
    +
    +/**
    + * A collection of Scala macros for working with SQL in a type-safe way.
    + */
    +private[sql] object SQLMacros {
    +  import scala.reflect.macros._
    +
    +  def sqlImpl(c: Context)(args: c.Expr[Any]*) =
    +    new Macros[c.type](c).sql(args)
    +
    +  case class Schema(dataType: DataType, nullable: Boolean)
    +
    +  class Macros[C <: Context](val c: C) extends ScalaReflection {
    +    val universe: c.universe.type = c.universe
    +
    +    import c.universe._
    +
    +    val rowTpe = tq"_root_.org.apache.spark.sql.catalyst.expressions.Row"
    +
    +    val rMacros = new RecordMacros[c.type](c)
    +
    +    trait InterpolatedItem {
    +      def placeholderName: String
    +      def registerCode: Tree
    +      def localRegister(catalog: Catalog, registry: FunctionRegistry)
    +    }
    +
    +    case class InterpolatedUDF(index: Int, expr: c.Expr[Any], returnType: DataType)
    +      extends InterpolatedItem{
    +
    +      val placeholderName = s"func$index"
    +
    +      def registerCode = q"""registerFunction($placeholderName, $expr)"""
    +
    +      def localRegister(catalog: Catalog, registry: FunctionRegistry) = {
    +        registry.registerFunction(
    +          placeholderName, (_: Seq[Expression]) => ScalaUdf(null, returnType, Nil))
    +      }
    +    }
    +
    +    case class InterpolatedTable(index: Int, expr: c.Expr[Any], schema: StructType)
    +      extends InterpolatedItem{
    +
    +      val placeholderName = s"table$index"
    +
    +      def registerCode = q"""$expr.registerTempTable($placeholderName)"""
    +
    +      def localRegister(catalog: Catalog, registry: FunctionRegistry) = {
    +        catalog.registerTable(None, placeholderName, LocalRelation(schema.toAttributes :_*))
    +      }
    +    }
    +
    +    case class RecSchema(name: String, index: Int, cType: DataType, tpe: Type)
    +
    +    def sql(args: Seq[c.Expr[Any]]) = {
    +
    +      val q"""
    +        $interpName(
    +          scala.StringContext.apply(..$rawParts))""" = c.prefix.tree
    +
    +      //rawParts.map(_.toString).foreach(println)
    +
    +      val parts =
    +        rawParts.map(
    +          _.toString.stripPrefix("\"")
    +           .replaceAll("\\\\", "")
    +           .stripSuffix("\""))
    +
    +      val interpolatedArguments = args.zipWithIndex.map { case (arg, i) =>
    +        // println(arg + " " + arg.actualType)
    +        arg.actualType match {
    +          case TypeRef(_, _, Seq(schemaType)) =>
    +            InterpolatedTable(i, arg, schemaFor(schemaType).dataType.asInstanceOf[StructType])
    +          case TypeRef(_, _, Seq(inputType, outputType)) =>
    +            InterpolatedUDF(i, arg, schemaFor(outputType).dataType)
    +        }
    +      }
    +
    +      val query = parts(0) + args.indices.map { i =>
    +        interpolatedArguments(i).placeholderName + parts(i + 1)
    +      }.mkString("")
    +
    +      val parser = new SqlParser()
    +      val logicalPlan = parser(query)
    +      val catalog = new SimpleCatalog(true)
    +      val functionRegistry = new SimpleFunctionRegistry
    +      val analyzer = new Analyzer(catalog, functionRegistry, true)
    +
    +      interpolatedArguments.foreach(_.localRegister(catalog, functionRegistry))
    +      val analyzedPlan = analyzer(logicalPlan)
    +
    +      val fields = analyzedPlan.output.map(attr => (attr.name, attr.dataType))
    +      val record = genRecord(q"row", fields)
    +
    +      val tree = q"""
    +        ..${interpolatedArguments.map(_.registerCode)}
    +        val result = sql($query)
    +        result.map(row => $record)
    +      """
    +
    +      // println(tree)
    +      c.Expr(tree)
    +    }
    +
    +    // TODO: Handle nullable fields
    --- End diff --
    
    Your point about changing return type based on complex static analysis is well taken and that is my hesitation as well.  That said...
    
    My though was that you could do something that is similar to a type ascription by adding a `WHERE a IS NOT NULL` or similar to the query whenever you don't want to deal with the option type.  This forces the programmer to explicitly denote a handling for null values (filter them out).
    
    Regarding joins, for inner joins you won't change the nullability of any output attributes so it'll still relate to the database schema.  For outer joins I think we *should* be forcing the programmer to explicitly deal with the fact that they are introducing nullability though their choice of join.
    
    Another possibility here would be to have two interpolators, one with boxing costs but simple semantics and one with explicit Options or primitives based on the SQL analysis.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [WIP][SPARK-2816][SQL] Type-safe SQL Queries

Posted by gzm0 <gi...@git.apache.org>.

Github user gzm0 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1759#discussion_r15785504
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/TypedSql.scala ---
    @@ -0,0 +1,202 @@
    +package org.apache.spark.sql
    +
    +import org.apache.spark.sql.catalyst.analysis._
    +import org.apache.spark.sql.catalyst.expressions.{Expression, ScalaUdf, AttributeReference}
    +import org.apache.spark.sql.catalyst.plans.logical.LocalRelation
    +import org.apache.spark.sql.catalyst.types._
    +
    +import scala.language.experimental.macros
    +import scala.language.existentials
    +
    +import records._
    +import Macros.RecordMacros
    +
    +import org.apache.spark.annotation.Experimental
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.sql.catalyst.{SqlParser, ScalaReflection}
    +
    +/**
    + * A collection of Scala macros for working with SQL in a type-safe way.
    + */
    +private[sql] object SQLMacros {
    +  import scala.reflect.macros._
    +
    +  def sqlImpl(c: Context)(args: c.Expr[Any]*) =
    +    new Macros[c.type](c).sql(args)
    +
    +  case class Schema(dataType: DataType, nullable: Boolean)
    +
    +  class Macros[C <: Context](val c: C) extends ScalaReflection {
    +    val universe: c.universe.type = c.universe
    +
    +    import c.universe._
    +
    +    val rowTpe = tq"_root_.org.apache.spark.sql.catalyst.expressions.Row"
    +
    +    val rMacros = new RecordMacros[c.type](c)
    +
    +    trait InterpolatedItem {
    +      def placeholderName: String
    +      def registerCode: Tree
    +      def localRegister(catalog: Catalog, registry: FunctionRegistry)
    +    }
    +
    +    case class InterpolatedUDF(index: Int, expr: c.Expr[Any], returnType: DataType)
    +      extends InterpolatedItem{
    +
    +      val placeholderName = s"func$index"
    +
    +      def registerCode = q"""registerFunction($placeholderName, $expr)"""
    +
    +      def localRegister(catalog: Catalog, registry: FunctionRegistry) = {
    +        registry.registerFunction(
    +          placeholderName, (_: Seq[Expression]) => ScalaUdf(null, returnType, Nil))
    +      }
    +    }
    +
    +    case class InterpolatedTable(index: Int, expr: c.Expr[Any], schema: StructType)
    +      extends InterpolatedItem{
    +
    +      val placeholderName = s"table$index"
    +
    +      def registerCode = q"""$expr.registerTempTable($placeholderName)"""
    +
    +      def localRegister(catalog: Catalog, registry: FunctionRegistry) = {
    +        catalog.registerTable(None, placeholderName, LocalRelation(schema.toAttributes :_*))
    +      }
    +    }
    +
    +    case class RecSchema(name: String, index: Int, cType: DataType, tpe: Type)
    +
    +    def sql(args: Seq[c.Expr[Any]]) = {
    +
    +      val q"""
    +        $interpName(
    +          scala.StringContext.apply(..$rawParts))""" = c.prefix.tree
    +
    +      //rawParts.map(_.toString).foreach(println)
    +
    +      val parts =
    +        rawParts.map(
    +          _.toString.stripPrefix("\"")
    +           .replaceAll("\\\\", "")
    +           .stripSuffix("\""))
    +
    +      val interpolatedArguments = args.zipWithIndex.map { case (arg, i) =>
    +        // println(arg + " " + arg.actualType)
    +        arg.actualType match {
    +          case TypeRef(_, _, Seq(schemaType)) =>
    +            InterpolatedTable(i, arg, schemaFor(schemaType).dataType.asInstanceOf[StructType])
    +          case TypeRef(_, _, Seq(inputType, outputType)) =>
    +            InterpolatedUDF(i, arg, schemaFor(outputType).dataType)
    +        }
    +      }
    +
    +      val query = parts(0) + args.indices.map { i =>
    +        interpolatedArguments(i).placeholderName + parts(i + 1)
    +      }.mkString("")
    +
    +      val parser = new SqlParser()
    +      val logicalPlan = parser(query)
    +      val catalog = new SimpleCatalog(true)
    +      val functionRegistry = new SimpleFunctionRegistry
    +      val analyzer = new Analyzer(catalog, functionRegistry, true)
    +
    +      interpolatedArguments.foreach(_.localRegister(catalog, functionRegistry))
    +      val analyzedPlan = analyzer(logicalPlan)
    +
    +      val fields = analyzedPlan.output.map(attr => (attr.name, attr.dataType))
    +      val record = genRecord(q"row", fields)
    +
    +      val tree = q"""
    +        ..${interpolatedArguments.map(_.registerCode)}
    +        val result = sql($query)
    +        result.map(row => $record)
    +      """
    +
    +      // println(tree)
    +      c.Expr(tree)
    +    }
    +
    +    // TODO: Handle nullable fields
    --- End diff --
    
    After having given this some thought, I first liked the 3rd approach best, I'd like to advocate for the 1st:
    
    Mapping a nullable type to an option makes sense if it comes directly from a database layout. However, most of the nullability in the uses cases for SQL will probably come from joins and are therefore potentially ruled out by further conditions. 
    
    IMHO changing the return type of a query based on complex static analysis of the SQL query is a *very* bad idea, especially since these types can't be ascribed.
    
    Therefore, it seems better to leave it up to the user to check this (and consider it a limitation of SQL's type system) and provide a `isNull` and a `toOption` method. 
    
    It is unclear to me how this is implementable. A value class with these members and an implicit conversion to its contained type might be a possibility. 
    
    WDYT?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [WIP][SPARK-2816][SQL] Type-safe SQL Queries

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1759#issuecomment-53685952
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19387/consoleFull) for   PR 1759 at commit [`500e746`](https://github.com/apache/spark/commit/500e746014b1a6c0406df7013a2febeecd858648).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [WIP][SPARK-2816][SQL] Type-safe SQL Queries

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1759#discussion_r15770570
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/TypedSql.scala ---
    @@ -0,0 +1,202 @@
    +package org.apache.spark.sql
    +
    +import org.apache.spark.sql.catalyst.analysis._
    +import org.apache.spark.sql.catalyst.expressions.{Expression, ScalaUdf, AttributeReference}
    +import org.apache.spark.sql.catalyst.plans.logical.LocalRelation
    +import org.apache.spark.sql.catalyst.types._
    +
    +import scala.language.experimental.macros
    +import scala.language.existentials
    +
    +import records._
    +import Macros.RecordMacros
    +
    +import org.apache.spark.annotation.Experimental
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.sql.catalyst.{SqlParser, ScalaReflection}
    +
    +/**
    + * A collection of Scala macros for working with SQL in a type-safe way.
    + */
    +private[sql] object SQLMacros {
    +  import scala.reflect.macros._
    +
    +  def sqlImpl(c: Context)(args: c.Expr[Any]*) =
    +    new Macros[c.type](c).sql(args)
    +
    +  case class Schema(dataType: DataType, nullable: Boolean)
    +
    +  class Macros[C <: Context](val c: C) extends ScalaReflection {
    +    val universe: c.universe.type = c.universe
    +
    +    import c.universe._
    +
    +    val rowTpe = tq"_root_.org.apache.spark.sql.catalyst.expressions.Row"
    +
    +    val rMacros = new RecordMacros[c.type](c)
    +
    +    trait InterpolatedItem {
    +      def placeholderName: String
    +      def registerCode: Tree
    +      def localRegister(catalog: Catalog, registry: FunctionRegistry)
    +    }
    +
    +    case class InterpolatedUDF(index: Int, expr: c.Expr[Any], returnType: DataType)
    +      extends InterpolatedItem{
    +
    +      val placeholderName = s"func$index"
    +
    +      def registerCode = q"""registerFunction($placeholderName, $expr)"""
    +
    +      def localRegister(catalog: Catalog, registry: FunctionRegistry) = {
    +        registry.registerFunction(
    +          placeholderName, (_: Seq[Expression]) => ScalaUdf(null, returnType, Nil))
    +      }
    +    }
    +
    +    case class InterpolatedTable(index: Int, expr: c.Expr[Any], schema: StructType)
    +      extends InterpolatedItem{
    +
    +      val placeholderName = s"table$index"
    +
    +      def registerCode = q"""$expr.registerTempTable($placeholderName)"""
    +
    +      def localRegister(catalog: Catalog, registry: FunctionRegistry) = {
    +        catalog.registerTable(None, placeholderName, LocalRelation(schema.toAttributes :_*))
    +      }
    +    }
    +
    +    case class RecSchema(name: String, index: Int, cType: DataType, tpe: Type)
    +
    +    def sql(args: Seq[c.Expr[Any]]) = {
    +
    +      val q"""
    +        $interpName(
    +          scala.StringContext.apply(..$rawParts))""" = c.prefix.tree
    +
    +      //rawParts.map(_.toString).foreach(println)
    +
    +      val parts =
    +        rawParts.map(
    +          _.toString.stripPrefix("\"")
    +           .replaceAll("\\\\", "")
    +           .stripSuffix("\""))
    +
    +      val interpolatedArguments = args.zipWithIndex.map { case (arg, i) =>
    +        // println(arg + " " + arg.actualType)
    +        arg.actualType match {
    +          case TypeRef(_, _, Seq(schemaType)) =>
    +            InterpolatedTable(i, arg, schemaFor(schemaType).dataType.asInstanceOf[StructType])
    +          case TypeRef(_, _, Seq(inputType, outputType)) =>
    +            InterpolatedUDF(i, arg, schemaFor(outputType).dataType)
    +        }
    +      }
    +
    +      val query = parts(0) + args.indices.map { i =>
    +        interpolatedArguments(i).placeholderName + parts(i + 1)
    +      }.mkString("")
    +
    +      val parser = new SqlParser()
    +      val logicalPlan = parser(query)
    +      val catalog = new SimpleCatalog(true)
    +      val functionRegistry = new SimpleFunctionRegistry
    +      val analyzer = new Analyzer(catalog, functionRegistry, true)
    +
    +      interpolatedArguments.foreach(_.localRegister(catalog, functionRegistry))
    +      val analyzedPlan = analyzer(logicalPlan)
    +
    +      val fields = analyzedPlan.output.map(attr => (attr.name, attr.dataType))
    +      val record = genRecord(q"row", fields)
    +
    +      val tree = q"""
    +        ..${interpolatedArguments.map(_.registerCode)}
    +        val result = sql($query)
    +        result.map(row => $record)
    +      """
    +
    +      // println(tree)
    --- End diff --
    
    Is there a clean way to do logging from macro code that isn't printlns?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [WIP][SPARK-2816][SQL] Type-safe SQL Queries

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus closed the pull request at:

    https://github.com/apache/spark/pull/1759


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [WIP][SPARK-2816][SQL] Type-safe SQL Queries

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1759#issuecomment-56121439
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20558/consoleFull) for   PR 1759 at commit [`677fa3d`](https://github.com/apache/spark/commit/677fa3d1cfffa7129cd5c8ab01b8918234cf4ce5).
     * This patch **fails** unit tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `trait ScalaReflection `
      * `  case class Schema(dataType: DataType, nullable: Boolean)`
      * `  class Macros[C <: Context](val c: C) extends ScalaReflection `
      * `    trait InterpolatedItem `
      * `    case class InterpolatedUDF(index: Int, expr: c.Expr[Any], returnType: DataType)`
      * `    case class InterpolatedTable(index: Int, expr: c.Expr[Any], schema: StructType)`
      * `    case class RecSchema(name: String, index: Int, cType: DataType, tpe: Type)`
      * `      case class ImplSchema(name: String, tpe: Type, impl: Tree)`
      * `trait TypedSQL `
      * `  implicit class SQLInterpolation(val strCtx: StringContext) `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [WIP][SPARK-2816][SQL] Type-safe SQL Queries

Posted by gzm0 <gi...@git.apache.org>.

Github user gzm0 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1759#discussion_r15770539
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/TypedSql.scala ---
    @@ -0,0 +1,202 @@
    +package org.apache.spark.sql
    +
    +import org.apache.spark.sql.catalyst.analysis._
    +import org.apache.spark.sql.catalyst.expressions.{Expression, ScalaUdf, AttributeReference}
    +import org.apache.spark.sql.catalyst.plans.logical.LocalRelation
    +import org.apache.spark.sql.catalyst.types._
    +
    +import scala.language.experimental.macros
    +import scala.language.existentials
    +
    +import records._
    +import Macros.RecordMacros
    +
    +import org.apache.spark.annotation.Experimental
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.sql.catalyst.{SqlParser, ScalaReflection}
    +
    +/**
    + * A collection of Scala macros for working with SQL in a type-safe way.
    + */
    +private[sql] object SQLMacros {
    +  import scala.reflect.macros._
    +
    +  def sqlImpl(c: Context)(args: c.Expr[Any]*) =
    +    new Macros[c.type](c).sql(args)
    +
    +  case class Schema(dataType: DataType, nullable: Boolean)
    +
    +  class Macros[C <: Context](val c: C) extends ScalaReflection {
    +    val universe: c.universe.type = c.universe
    +
    +    import c.universe._
    +
    +    val rowTpe = tq"_root_.org.apache.spark.sql.catalyst.expressions.Row"
    +
    +    val rMacros = new RecordMacros[c.type](c)
    +
    +    trait InterpolatedItem {
    +      def placeholderName: String
    +      def registerCode: Tree
    +      def localRegister(catalog: Catalog, registry: FunctionRegistry)
    +    }
    +
    +    case class InterpolatedUDF(index: Int, expr: c.Expr[Any], returnType: DataType)
    +      extends InterpolatedItem{
    +
    +      val placeholderName = s"func$index"
    +
    +      def registerCode = q"""registerFunction($placeholderName, $expr)"""
    +
    +      def localRegister(catalog: Catalog, registry: FunctionRegistry) = {
    +        registry.registerFunction(
    +          placeholderName, (_: Seq[Expression]) => ScalaUdf(null, returnType, Nil))
    +      }
    +    }
    +
    +    case class InterpolatedTable(index: Int, expr: c.Expr[Any], schema: StructType)
    +      extends InterpolatedItem{
    +
    +      val placeholderName = s"table$index"
    +
    +      def registerCode = q"""$expr.registerTempTable($placeholderName)"""
    +
    +      def localRegister(catalog: Catalog, registry: FunctionRegistry) = {
    +        catalog.registerTable(None, placeholderName, LocalRelation(schema.toAttributes :_*))
    +      }
    +    }
    +
    +    case class RecSchema(name: String, index: Int, cType: DataType, tpe: Type)
    +
    +    def sql(args: Seq[c.Expr[Any]]) = {
    +
    +      val q"""
    +        $interpName(
    +          scala.StringContext.apply(..$rawParts))""" = c.prefix.tree
    +
    +      //rawParts.map(_.toString).foreach(println)
    +
    +      val parts =
    +        rawParts.map(
    +          _.toString.stripPrefix("\"")
    +           .replaceAll("\\\\", "")
    +           .stripSuffix("\""))
    +
    +      val interpolatedArguments = args.zipWithIndex.map { case (arg, i) =>
    +        // println(arg + " " + arg.actualType)
    +        arg.actualType match {
    +          case TypeRef(_, _, Seq(schemaType)) =>
    +            InterpolatedTable(i, arg, schemaFor(schemaType).dataType.asInstanceOf[StructType])
    +          case TypeRef(_, _, Seq(inputType, outputType)) =>
    +            InterpolatedUDF(i, arg, schemaFor(outputType).dataType)
    +        }
    +      }
    +
    +      val query = parts(0) + args.indices.map { i =>
    +        interpolatedArguments(i).placeholderName + parts(i + 1)
    +      }.mkString("")
    +
    +      val parser = new SqlParser()
    +      val logicalPlan = parser(query)
    +      val catalog = new SimpleCatalog(true)
    +      val functionRegistry = new SimpleFunctionRegistry
    +      val analyzer = new Analyzer(catalog, functionRegistry, true)
    +
    +      interpolatedArguments.foreach(_.localRegister(catalog, functionRegistry))
    +      val analyzedPlan = analyzer(logicalPlan)
    +
    +      val fields = analyzedPlan.output.map(attr => (attr.name, attr.dataType))
    +      val record = genRecord(q"row", fields)
    +
    +      val tree = q"""
    +        ..${interpolatedArguments.map(_.registerCode)}
    +        val result = sql($query)
    +        result.map(row => $record)
    +      """
    +
    +      // println(tree)
    --- End diff --
    
    Lingering debug code


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [WIP][SPARK-2816][SQL] Type-safe SQL Queries

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1759#issuecomment-55953598
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20475/consoleFull) for   PR 1759 at commit [`58c33a8`](https://github.com/apache/spark/commit/58c33a810ce4b64dfe4b1bc5076003bd306de110).
     * This patch **passes** unit tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `trait ScalaReflection `
      * `  case class Schema(dataType: DataType, nullable: Boolean)`
      * `  class Macros[C <: Context](val c: C) extends ScalaReflection `
      * `    trait InterpolatedItem `
      * `    case class InterpolatedUDF(index: Int, expr: c.Expr[Any], returnType: DataType)`
      * `    case class InterpolatedTable(index: Int, expr: c.Expr[Any], schema: StructType)`
      * `    case class RecSchema(name: String, index: Int, cType: DataType, tpe: Type)`
      * `      case class ImplSchema(name: String, tpe: Type, impl: Tree)`
      * `trait TypedSQL `
      * `  implicit class SQLInterpolation(val strCtx: StringContext) `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [WIP][SPARK-2816][SQL] Type-safe SQL Queries

Posted by gzm0 <gi...@git.apache.org>.

Github user gzm0 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1759#discussion_r15770817
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/TypedSql.scala ---
    @@ -0,0 +1,202 @@
    +package org.apache.spark.sql
    +
    +import org.apache.spark.sql.catalyst.analysis._
    +import org.apache.spark.sql.catalyst.expressions.{Expression, ScalaUdf, AttributeReference}
    +import org.apache.spark.sql.catalyst.plans.logical.LocalRelation
    +import org.apache.spark.sql.catalyst.types._
    +
    +import scala.language.experimental.macros
    +import scala.language.existentials
    +
    +import records._
    +import Macros.RecordMacros
    +
    +import org.apache.spark.annotation.Experimental
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.sql.catalyst.{SqlParser, ScalaReflection}
    +
    +/**
    + * A collection of Scala macros for working with SQL in a type-safe way.
    + */
    +private[sql] object SQLMacros {
    +  import scala.reflect.macros._
    +
    +  def sqlImpl(c: Context)(args: c.Expr[Any]*) =
    +    new Macros[c.type](c).sql(args)
    +
    +  case class Schema(dataType: DataType, nullable: Boolean)
    +
    +  class Macros[C <: Context](val c: C) extends ScalaReflection {
    +    val universe: c.universe.type = c.universe
    +
    +    import c.universe._
    +
    +    val rowTpe = tq"_root_.org.apache.spark.sql.catalyst.expressions.Row"
    +
    +    val rMacros = new RecordMacros[c.type](c)
    +
    +    trait InterpolatedItem {
    +      def placeholderName: String
    +      def registerCode: Tree
    +      def localRegister(catalog: Catalog, registry: FunctionRegistry)
    +    }
    +
    +    case class InterpolatedUDF(index: Int, expr: c.Expr[Any], returnType: DataType)
    +      extends InterpolatedItem{
    +
    +      val placeholderName = s"func$index"
    +
    +      def registerCode = q"""registerFunction($placeholderName, $expr)"""
    +
    +      def localRegister(catalog: Catalog, registry: FunctionRegistry) = {
    +        registry.registerFunction(
    +          placeholderName, (_: Seq[Expression]) => ScalaUdf(null, returnType, Nil))
    +      }
    +    }
    +
    +    case class InterpolatedTable(index: Int, expr: c.Expr[Any], schema: StructType)
    +      extends InterpolatedItem{
    +
    +      val placeholderName = s"table$index"
    +
    +      def registerCode = q"""$expr.registerTempTable($placeholderName)"""
    +
    +      def localRegister(catalog: Catalog, registry: FunctionRegistry) = {
    +        catalog.registerTable(None, placeholderName, LocalRelation(schema.toAttributes :_*))
    +      }
    +    }
    +
    +    case class RecSchema(name: String, index: Int, cType: DataType, tpe: Type)
    +
    +    def sql(args: Seq[c.Expr[Any]]) = {
    +
    +      val q"""
    +        $interpName(
    +          scala.StringContext.apply(..$rawParts))""" = c.prefix.tree
    +
    +      //rawParts.map(_.toString).foreach(println)
    +
    +      val parts =
    +        rawParts.map(
    +          _.toString.stripPrefix("\"")
    +           .replaceAll("\\\\", "")
    +           .stripSuffix("\""))
    +
    +      val interpolatedArguments = args.zipWithIndex.map { case (arg, i) =>
    +        // println(arg + " " + arg.actualType)
    +        arg.actualType match {
    +          case TypeRef(_, _, Seq(schemaType)) =>
    +            InterpolatedTable(i, arg, schemaFor(schemaType).dataType.asInstanceOf[StructType])
    +          case TypeRef(_, _, Seq(inputType, outputType)) =>
    +            InterpolatedUDF(i, arg, schemaFor(outputType).dataType)
    +        }
    +      }
    +
    +      val query = parts(0) + args.indices.map { i =>
    +        interpolatedArguments(i).placeholderName + parts(i + 1)
    +      }.mkString("")
    +
    +      val parser = new SqlParser()
    +      val logicalPlan = parser(query)
    +      val catalog = new SimpleCatalog(true)
    +      val functionRegistry = new SimpleFunctionRegistry
    +      val analyzer = new Analyzer(catalog, functionRegistry, true)
    +
    +      interpolatedArguments.foreach(_.localRegister(catalog, functionRegistry))
    +      val analyzedPlan = analyzer(logicalPlan)
    +
    +      val fields = analyzedPlan.output.map(attr => (attr.name, attr.dataType))
    +      val record = genRecord(q"row", fields)
    +
    +      val tree = q"""
    +        ..${interpolatedArguments.map(_.registerCode)}
    +        val result = sql($query)
    +        result.map(row => $record)
    +      """
    +
    +      // println(tree)
    +      c.Expr(tree)
    +    }
    +
    +    // TODO: Handle nullable fields
    --- End diff --
    
    IIUC, a null value in a primitive field will cause a NPE. We might want to use the java types for all the primitive fields and rely on implicit conversion on the usage site. This way, a NPE can be avoided at the usage site. It comes at the cost of boxing everything though. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org