You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by yhuai <gi...@git.apache.org> on 2014/07/09 21:32:55 UTC

[GitHub] spark pull request: [WIP][SPARK-2179][SQL] Public API for DataType...

GitHub user yhuai opened a pull request:

    https://github.com/apache/spark/pull/1346

    [WIP][SPARK-2179][SQL] Public API for DataTypes and Schema

    The current PR contains the following changes:
    * Expose `DataType`s in the sql package (internal details are private to sql).
    * Introduce `createSchemaRDD` to create a `SchemaRDD` from an `RDD` with a provided schema (represented by a `StructType`) and a provided function to construct `Row`,
    * Add a function `simpleString` to every `DataType`. Also, the schema represented by a `StructType` can be visualized by `printSchema`.
    
    An example of using `createSchemaRDD` is shown below.
    ```scala
    import org.apache.spark.sql._
    val sqlContext = new org.apache.spark.sql.SQLContext(sc)
    
    val schema =
      StructType(
        StructField("name", StringType, false) ::
        StructField("age", IntegerType, true) :: Nil)
    
    def createRow(record: String): Row = {
      val items = record.split(",")
      Row(items(0), items(1).trim.toInt)
    }
    
    val people = sc.textFile("examples/src/main/resources/people.txt")
    val peopleSchemaRDD = sqlContext.createSchemaRDD(people, schema, createRow)
    peopleSchemaRDD.printSchema
    // root
    // |-- name: string (nullable = false)
    // |-- age: integer (nullable = true)
    
    peopleSchemaRDD.registerAsTable("people")
    sqlContext.sql("select name from people").collect.foreach(println)
    ```
    
    I am currently working on the following things.
    * Move those general purpose functions introduced by the JSON support to the type system.
    * Add pre-defined `constructRow` functions to help users easily use `createSchemaRDD` (e.g. applying a schema to a JSON dataset).
    * Add a method to let user optionally validate the schema of a `SchemaRDD` by checking every row.
    * Add needed tests.
    
    JIRA: https://issues.apache.org/jira/browse/SPARK-2179

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/yhuai/spark dataTypeAndSchema

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/1346.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1346
    
----
commit 16be3e56e83a406af86b9e7f18059ec7a2595a9e
Author: Yin Huai <hu...@cse.ohio-state.edu>
Date:   2014-07-09T18:51:24Z

    This commit contains three changes:
    * Expose `DataType`s in the sql package (internal details are private to sql).
    * Introduce `createSchemaRDD` to create a `SchemaRDD` from an `RDD` with a provided schema (represented by a `StructType`) and a provided function to construct `Row`,
    * Add a function `simpleString` to every `DataType`. Also, the schema represented by a `StructType` can be visualized by `printSchema`.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1346#discussion_r15483514
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/package.scala ---
    @@ -0,0 +1,401 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark
    +
    +import org.apache.spark.annotation.DeveloperApi
    +
    +/**
    + * Allows the execution of relational queries, including those expressed in SQL using Spark.
    + *
    + *  @groupname dataType Data types
    + *  @groupdesc Spark SQL data types.
    + *  @groupprio dataType -3
    + *  @groupname field Field
    + *  @groupprio field -2
    + *  @groupname row Row
    + *  @groupprio row -1
    + */
    +package object sql {
    +
    +  protected[sql] type Logging = com.typesafe.scalalogging.slf4j.Logging
    +
    +  /**
    +   * :: DeveloperApi ::
    +   *
    +   * Represents one row of output from a relational operator.
    +   * @group row
    +   */
    +  @DeveloperApi
    +  type Row = catalyst.expressions.Row
    +
    +  /**
    +   * :: DeveloperApi ::
    +   *
    +   * A [[Row]] object can be constructed by providing field values. Example:
    +   * {{{
    +   * import org.apache.spark.sql._
    +   *
    +   * // Create a Row from values.
    +   * Row(value1, value2, value3, ...)
    +   * // Create a Row from a Seq of values.
    +   * Row.fromSeq(Seq(value1, value2, ...))
    +   * }}}
    +   *
    +   * A value of a row can be accessed through both generic access by ordinal,
    +   * which will incur boxing overhead for primitives, as well as native primitive access.
    +   * An example of generic access by ordinal:
    +   * {{{
    +   * import org.apache.spark.sql._
    +   *
    +   * val row = Row(1, true, "a string", null)
    +   * // row: Row = [1,true,a string,null]
    +   * val firstValue = row(0)
    +   * // firstValue: Any = 1
    +   * val fourthValue = row(3)
    +   * // fourthValue: Any = null
    +   * }}}
    +   *
    +   * For native primitive access, it is invalid to use the native primitive interface to retrieve
    +   * a value that is null, instead a user must check `isNullAt` before attempting to retrieve a
    +   * value that might be null.
    +   * An example of native primitive access:
    +   * {{{
    +   * // using the row from the previous example.
    +   * val firstValue = row.getInt(0)
    +   * // firstValue: Int = 1
    +   * val isNull = row.isNullAt(3)
    +   * // isNull: Boolean = true
    +   * }}}
    +   *
    +   * Interfaces related to native primitive access are:
    +   *
    +   * `isNullAt(i: Int): Boolean`
    +   *
    +   * `getInt(i: Int): Int`
    +   *
    +   * `getLong(i: Int): Long`
    +   *
    +   * `getDouble(i: Int): Double`
    +   *
    +   * `getFloat(i: Int): Float`
    +   *
    +   * `getBoolean(i: Int): Boolean`
    +   *
    +   * `getShort(i: Int): Short`
    +   *
    +   * `getByte(i: Int): Byte`
    +   *
    +   * `getString(i: Int): String`
    +   *
    +   * Fields in a [[Row]] object can be extracted in a pattern match. Example:
    +   * {{{
    +   * import org.apache.spark.sql._
    +   *
    +   * val pairs = sql("SELECT key, value FROM src").rdd.map {
    +   *   case Row(key: Int, value: String) =>
    +   *     key -> value
    +   * }
    +   * }}}
    +   *
    +   * @group row
    +   */
    +  @DeveloperApi
    +  val Row = catalyst.expressions.Row
    +
    +  /**
    +   * :: DeveloperApi ::
    +   *
    +   * The base type of all Spark SQL data types.
    +   *
    +   * @group dataType
    +   */
    +  @DeveloperApi
    +  type DataType = catalyst.types.DataType
    +
    +  /**
    +   * :: DeveloperApi ::
    +   *
    +   * The data type representing `String` values
    +   *
    +   * @group dataType
    +   */
    +  @DeveloperApi
    +  val StringType = catalyst.types.StringType
    +
    +  /**
    +   * :: DeveloperApi ::
    +   *
    +   * The data type representing `Array[Byte]` values.
    +   *
    +   * @group dataType
    +   */
    +  @DeveloperApi
    +  val BinaryType = catalyst.types.BinaryType
    +
    +  /**
    +   * :: DeveloperApi ::
    +   *
    +   * The data type representing `Boolean` values.
    +   *
    +   *@group dataType
    +   */
    +  @DeveloperApi
    +  val BooleanType = catalyst.types.BooleanType
    +
    +  /**
    +   * :: DeveloperApi ::
    +   *
    +   * The data type representing `java.sql.Timestamp` values.
    +   *
    +   * @group dataType
    +   */
    +  @DeveloperApi
    +  val TimestampType = catalyst.types.TimestampType
    +
    +  /**
    +   * :: DeveloperApi ::
    +   *
    +   * The data type representing `scala.math.BigDecimal` values.
    +   *
    +   * @group dataType
    +   */
    +  @DeveloperApi
    +  val DecimalType = catalyst.types.DecimalType
    +
    +  /**
    +   * :: DeveloperApi ::
    +   *
    +   * The data type representing `Double` values.
    +   *
    +   * @group dataType
    +   */
    +  @DeveloperApi
    +  val DoubleType = catalyst.types.DoubleType
    +
    +  /**
    +   * :: DeveloperApi ::
    +   *
    +   * The data type representing `Float` values.
    +   *
    +   * @group dataType
    +   */
    +  @DeveloperApi
    +  val FloatType = catalyst.types.FloatType
    +
    +  /**
    +   * :: DeveloperApi ::
    +   *
    +   * The data type representing `Byte` values.
    +   *
    +   * @group dataType
    +   */
    +  @DeveloperApi
    +  val ByteType = catalyst.types.ByteType
    +
    +  /**
    +   * :: DeveloperApi ::
    +   *
    +   * The data type representing `Int` values.
    +   *
    +   * @group dataType
    +   */
    +  @DeveloperApi
    +  val IntegerType = catalyst.types.IntegerType
    +
    +  /**
    +   * :: DeveloperApi ::
    +   *
    +   * The data type representing `Long` values.
    +   *
    +   * @group dataType
    +   */
    +  @DeveloperApi
    +  val LongType = catalyst.types.LongType
    +
    +  /**
    +   * :: DeveloperApi ::
    +   *
    +   * The data type representing `Short` values.
    +   *
    +   * @group dataType
    +   */
    +  @DeveloperApi
    +  val ShortType = catalyst.types.ShortType
    +
    +  /**
    +   * :: DeveloperApi ::
    +   *
    +   * The data type representing `Seq`s.
    +   * An [[ArrayType]] object comprises two fields, `elementType: [[DataType]]` and
    +   * `containsNull: Boolean`. The field of `elementType` is used to specify the type of
    +   * array elements. The field of `containsNull` is used to specify if the array can have
    +   * any `null` value.
    --- End diff --
    
    Nit: "values"


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

Posted by yhuai <gi...@git.apache.org>.

Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1346#discussion_r15499882
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/api/java/types/DataType.java ---
    @@ -0,0 +1,170 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.api.java.types;
    +
    +import java.util.HashSet;
    +import java.util.List;
    +import java.util.Set;
    +
    +/**
    + * The base type of all Spark SQL data types.
    + */
    +public abstract class DataType {
    +
    +  /**
    +   * Gets the StringType object.
    +   */
    +  public static final StringType StringType = new StringType();
    +
    +  /**
    +   * Gets the BinaryType object.
    +   */
    +  public static final BinaryType BinaryType = new BinaryType();
    +
    +  /**
    +   * Gets the BooleanType object.
    +   */
    +  public static final BooleanType BooleanType = new BooleanType();
    +
    +  /**
    +   * Gets the TimestampType object.
    +   */
    +  public static final TimestampType TimestampType = new TimestampType();
    +
    +  /**
    +   * Gets the DecimalType object.
    +   */
    +  public static final DecimalType DecimalType = new DecimalType();
    +
    +  /**
    +   * Gets the DoubleType object.
    +   */
    +  public static final DoubleType DoubleType = new DoubleType();
    +
    +  /**
    +   * Gets the FloatType object.
    +   */
    +  public static final FloatType FloatType = new FloatType();
    +
    +  /**
    +   * Gets the ByteType object.
    +   */
    +  public static final ByteType ByteType = new ByteType();
    +
    +  /**
    +   * Gets the IntegerType object.
    +   */
    +  public static final IntegerType IntegerType = new IntegerType();
    +
    +  /**
    +   * Gets the LongType object.
    +   */
    +  public static final LongType LongType = new LongType();
    +
    +  /**
    +   * Gets the ShortType object.
    +   */
    +  public static final ShortType ShortType = new ShortType();
    +
    +  /**
    +   * Creates an ArrayType by specifying the data type of elements ({@code elementType}) and
    +   * whether the array contains null values ({@code containsNull}).
    +   * @param elementType
    +   * @param containsNull
    +   * @return
    +   */
    +  public static ArrayType createArrayType(DataType elementType, boolean containsNull) {
    --- End diff --
    
    Done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1346#issuecomment-50531165
  
    QA results for PR 1346:<br>- This patch FAILED unit tests.<br>- This patch merges cleanly<br>- This patch adds no public classes<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17372/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [WIP][SPARK-2179][SQL] Public API for DataType...

Posted by ueshin <gi...@git.apache.org>.

Github user ueshin commented on the pull request:

    https://github.com/apache/spark/pull/1346#issuecomment-48560066
  
    Hi, I'm wondering if `MapType` will have something like `containsNull` for `ArrayType`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1346#issuecomment-50426883
  
    QA results for PR 1346:<br>- This patch PASSES unit tests.<br>- This patch merges cleanly<br>- This patch adds no public classes<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17320/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1346#issuecomment-50531062
  
    QA tests have started for PR 1346. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17372/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1346#discussion_r15510995
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/WrapDynamic.scala ---
    @@ -21,7 +21,9 @@ import scala.language.dynamics
     
     import org.apache.spark.sql.catalyst.types.DataType
     
    -case object DynamicType extends DataType
    +case object DynamicType extends DataType {
    --- End diff --
    
    Do you mind adding scaladoc to explain what DynamicType is used for? (While you are at it, also add scaladoc for WrapDynamic and DynamicRow)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

Posted by yhuai <gi...@git.apache.org>.

Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1346#discussion_r15492345
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala ---
    @@ -374,4 +444,97 @@ class SQLContext(@transient val sparkContext: SparkContext)
         new SchemaRDD(this, SparkLogicalPlan(ExistingRdd(schema, rowRdd)))
       }
     
    +  /**
    +   * Returns the equivalent StructField in Scala for the given StructField in Java.
    +   */
    +  protected def asJavaStructField(scalaStructField: StructField): JStructField = {
    --- End diff --
    
    Originally, I put it in `JavaSQLContext`. But, I found I need the access to `asJavaDataType` in `JavaSchemaRDD` which only has `SQLContext` instead of `JavaSQLContext`. I guess we want to refactor `JavaSchemaRDD` to use `JavaSQLContext` instead of `SQLContext`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [WIP][SPARK-2179][SQL] Public API for DataType...

Posted by yhuai <gi...@git.apache.org>.

Github user yhuai commented on the pull request:

    https://github.com/apache/spark/pull/1346#issuecomment-48631313
  
    Yeah, I will make sure new APIs are usable in Java and Python.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [WIP][SPARK-2179][SQL] Public API for DataType...

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1346#discussion_r14904153
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/types/dataTypes.scala ---
    @@ -197,47 +213,145 @@ object FractionalType {
       }
     }
     abstract class FractionalType extends NumericType {
    -  val fractional: Fractional[JvmType]
    +  private[sql] val fractional: Fractional[JvmType]
     }
     
     case object DecimalType extends FractionalType {
    -  type JvmType = BigDecimal
    -  @transient lazy val tag = typeTag[JvmType]
    -  val numeric = implicitly[Numeric[BigDecimal]]
    -  val fractional = implicitly[Fractional[BigDecimal]]
    -  val ordering = implicitly[Ordering[JvmType]]
    +  private[sql] type JvmType = BigDecimal
    +  @transient private[sql] lazy val tag = typeTag[JvmType]
    +  private[sql] val numeric = implicitly[Numeric[BigDecimal]]
    +  private[sql] val fractional = implicitly[Fractional[BigDecimal]]
    +  private[sql] val ordering = implicitly[Ordering[JvmType]]
    +  def simpleString: String = "decimal"
     }
     
     case object DoubleType extends FractionalType {
    -  type JvmType = Double
    -  @transient lazy val tag = typeTag[JvmType]
    -  val numeric = implicitly[Numeric[Double]]
    -  val fractional = implicitly[Fractional[Double]]
    -  val ordering = implicitly[Ordering[JvmType]]
    +  private[sql] type JvmType = Double
    +  @transient private[sql] lazy val tag = typeTag[JvmType]
    +  private[sql] val numeric = implicitly[Numeric[Double]]
    +  private[sql] val fractional = implicitly[Fractional[Double]]
    +  private[sql] val ordering = implicitly[Ordering[JvmType]]
    +  def simpleString: String = "double"
     }
     
     case object FloatType extends FractionalType {
    -  type JvmType = Float
    -  @transient lazy val tag = typeTag[JvmType]
    -  val numeric = implicitly[Numeric[Float]]
    -  val fractional = implicitly[Fractional[Float]]
    -  val ordering = implicitly[Ordering[JvmType]]
    +  private[sql] type JvmType = Float
    +  @transient private[sql] lazy val tag = typeTag[JvmType]
    +  private[sql] val numeric = implicitly[Numeric[Float]]
    +  private[sql] val fractional = implicitly[Fractional[Float]]
    +  private[sql] val ordering = implicitly[Ordering[JvmType]]
    +  def simpleString: String = "float"
     }
     
    -case class ArrayType(elementType: DataType) extends DataType
    +object ArrayType {
    +  def apply(elementType: DataType): ArrayType = ArrayType(elementType, false)
    +}
     
    -case class StructField(name: String, dataType: DataType, nullable: Boolean)
    +case class ArrayType(elementType: DataType, containsNull: Boolean) extends DataType {
    +  private[sql] def buildFormattedString(prefix: String, builder: StringBuilder): Unit = {
    +    builder.append(
    +      s"${prefix}-- element: ${elementType.simpleString} (containsNull = ${containsNull})\n")
    +    elementType match {
    +      case array: ArrayType =>
    +        array.buildFormattedString(s"$prefix    |", builder)
    +      case struct: StructType =>
    +        struct.buildFormattedString(s"$prefix    |", builder)
    +      case map: MapType =>
    +        map.buildFormattedString(s"$prefix    |", builder)
    +      case _ =>
    +    }
    +  }
    +
    +  def simpleString: String = "array"
    +}
    +
    +case class StructField(name: String, dataType: DataType, nullable: Boolean) {
    +
    +  private[sql] def buildFormattedString(prefix: String, builder: StringBuilder): Unit = {
    +    builder.append(s"${prefix}-- ${name}: ${dataType.simpleString} (nullable = ${nullable})\n")
    +    dataType match {
    +      case array: ArrayType =>
    +        array.buildFormattedString(s"$prefix    |", builder)
    +      case struct: StructType =>
    +        struct.buildFormattedString(s"$prefix    |", builder)
    +      case map: MapType =>
    +        map.buildFormattedString(s"$prefix    |", builder)
    +      case _ =>
    +    }
    +  }
    +}
     
     object StructType {
    -  def fromAttributes(attributes: Seq[Attribute]): StructType = {
    +  def fromAttributes(attributes: Seq[Attribute]): StructType =
         StructType(attributes.map(a => StructField(a.name, a.dataType, a.nullable)))
    -  }
     
    -  // def apply(fields: Seq[StructField]) = new StructType(fields.toIndexedSeq)
    +  private def validateFields(fields: Seq[StructField]): Boolean =
    +    fields.map(field => field.name).distinct.size == fields.size
    +
    +  def apply[A <: String: ClassTag, B <: DataType: ClassTag](fields: (A, B)*): StructType =
    --- End diff --
    
    Do we need type parameterization and class tags here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [WIP][SPARK-2179][SQL] Public API for DataType...

Posted by yhuai <gi...@git.apache.org>.

Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1346#discussion_r14906180
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/types/dataTypes.scala ---
    @@ -197,47 +213,145 @@ object FractionalType {
       }
     }
     abstract class FractionalType extends NumericType {
    -  val fractional: Fractional[JvmType]
    +  private[sql] val fractional: Fractional[JvmType]
     }
     
     case object DecimalType extends FractionalType {
    -  type JvmType = BigDecimal
    -  @transient lazy val tag = typeTag[JvmType]
    -  val numeric = implicitly[Numeric[BigDecimal]]
    -  val fractional = implicitly[Fractional[BigDecimal]]
    -  val ordering = implicitly[Ordering[JvmType]]
    +  private[sql] type JvmType = BigDecimal
    +  @transient private[sql] lazy val tag = typeTag[JvmType]
    +  private[sql] val numeric = implicitly[Numeric[BigDecimal]]
    +  private[sql] val fractional = implicitly[Fractional[BigDecimal]]
    +  private[sql] val ordering = implicitly[Ordering[JvmType]]
    +  def simpleString: String = "decimal"
     }
     
     case object DoubleType extends FractionalType {
    -  type JvmType = Double
    -  @transient lazy val tag = typeTag[JvmType]
    -  val numeric = implicitly[Numeric[Double]]
    -  val fractional = implicitly[Fractional[Double]]
    -  val ordering = implicitly[Ordering[JvmType]]
    +  private[sql] type JvmType = Double
    +  @transient private[sql] lazy val tag = typeTag[JvmType]
    +  private[sql] val numeric = implicitly[Numeric[Double]]
    +  private[sql] val fractional = implicitly[Fractional[Double]]
    +  private[sql] val ordering = implicitly[Ordering[JvmType]]
    +  def simpleString: String = "double"
     }
     
     case object FloatType extends FractionalType {
    -  type JvmType = Float
    -  @transient lazy val tag = typeTag[JvmType]
    -  val numeric = implicitly[Numeric[Float]]
    -  val fractional = implicitly[Fractional[Float]]
    -  val ordering = implicitly[Ordering[JvmType]]
    +  private[sql] type JvmType = Float
    +  @transient private[sql] lazy val tag = typeTag[JvmType]
    +  private[sql] val numeric = implicitly[Numeric[Float]]
    +  private[sql] val fractional = implicitly[Fractional[Float]]
    +  private[sql] val ordering = implicitly[Ordering[JvmType]]
    +  def simpleString: String = "float"
     }
     
    -case class ArrayType(elementType: DataType) extends DataType
    +object ArrayType {
    +  def apply(elementType: DataType): ArrayType = ArrayType(elementType, false)
    +}
     
    -case class StructField(name: String, dataType: DataType, nullable: Boolean)
    +case class ArrayType(elementType: DataType, containsNull: Boolean) extends DataType {
    +  private[sql] def buildFormattedString(prefix: String, builder: StringBuilder): Unit = {
    +    builder.append(
    +      s"${prefix}-- element: ${elementType.simpleString} (containsNull = ${containsNull})\n")
    +    elementType match {
    +      case array: ArrayType =>
    +        array.buildFormattedString(s"$prefix    |", builder)
    +      case struct: StructType =>
    +        struct.buildFormattedString(s"$prefix    |", builder)
    +      case map: MapType =>
    +        map.buildFormattedString(s"$prefix    |", builder)
    +      case _ =>
    +    }
    +  }
    +
    +  def simpleString: String = "array"
    +}
    +
    +case class StructField(name: String, dataType: DataType, nullable: Boolean) {
    +
    +  private[sql] def buildFormattedString(prefix: String, builder: StringBuilder): Unit = {
    +    builder.append(s"${prefix}-- ${name}: ${dataType.simpleString} (nullable = ${nullable})\n")
    +    dataType match {
    +      case array: ArrayType =>
    +        array.buildFormattedString(s"$prefix    |", builder)
    +      case struct: StructType =>
    +        struct.buildFormattedString(s"$prefix    |", builder)
    +      case map: MapType =>
    +        map.buildFormattedString(s"$prefix    |", builder)
    +      case _ =>
    +    }
    +  }
    +}
     
     object StructType {
    -  def fromAttributes(attributes: Seq[Attribute]): StructType = {
    +  def fromAttributes(attributes: Seq[Attribute]): StructType =
         StructType(attributes.map(a => StructField(a.name, a.dataType, a.nullable)))
    -  }
     
    -  // def apply(fields: Seq[StructField]) = new StructType(fields.toIndexedSeq)
    +  private def validateFields(fields: Seq[StructField]): Boolean =
    +    fields.map(field => field.name).distinct.size == fields.size
    +
    +  def apply[A <: String: ClassTag, B <: DataType: ClassTag](fields: (A, B)*): StructType =
    --- End diff --
    
    The original constructor of `StructType` takes a Seq[StructField] as the input parameter. If we have something like 
    ```
    def apply(fields: (String, DataType)*): StructType =
      StructType(fields.map(field => StructField(field._1, field._2, true)))
    ```
    Scala compiler will complain that this constructor and the original constructor have same type of the input parameter after erasure (`fields: Seq`). 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [WIP][SPARK-2179][SQL] Public API for DataType...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1346#issuecomment-48781836
  
    QA results for PR 1346:<br>- This patch PASSES unit tests.<br>- This patch merges cleanly<br>- This patch adds the following public classes (experimental):<br>case class ArrayType(elementType: DataType) extends DataType {<br>case class StructField(name: String, dataType: DataType, nullable: Boolean) {<br>case class MapType(keyType: DataType, valueType: DataType) extends DataType {<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16572/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1346#discussion_r15483028
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala ---
    @@ -374,4 +444,97 @@ class SQLContext(@transient val sparkContext: SparkContext)
         new SchemaRDD(this, SparkLogicalPlan(ExistingRdd(schema, rowRdd)))
       }
     
    +  /**
    +   * Returns the equivalent StructField in Scala for the given StructField in Java.
    +   */
    +  protected def asJavaStructField(scalaStructField: StructField): JStructField = {
    --- End diff --
    
    Should this be here or in the JavaSQLContext?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [WIP][SPARK-2179][SQL] Public API for DataType...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1346#issuecomment-48792467
  
    QA results for PR 1346:<br>- This patch FAILED unit tests.<br>- This patch merges cleanly<br>- This patch adds the following public classes (experimental):<br>case class ArrayType(elementType: DataType, containsNull: Boolean) extends DataType {<br>case class StructField(name: String, dataType: DataType, nullable: Boolean) {<br>case class MapType(keyType: DataType, valueType: DataType) extends DataType {<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16578/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1346#issuecomment-50581580
  
    QA results for PR 1346:<br>- This patch PASSES unit tests.<br>- This patch merges cleanly<br>- This patch adds no public classes<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17423/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1346#discussion_r15511405
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/json/JsonRDD.scala ---
    @@ -259,8 +268,12 @@ private[sql] object JsonRDD extends Logging {
           // the ObjectMapper will take the last value associated with this duplicate key.
           // For example: for {"key": 1, "key":2}, we will get "key"->2.
           val mapper = new ObjectMapper()
    -      iter.map(record => mapper.readValue(record, classOf[java.util.Map[String, Any]]))
    -      }).map(scalafy).map(_.asInstanceOf[Map[String, Any]])
    +      iter.map {
    +        record =>
    --- End diff --
    
    move record to the previous line and indent the whole thing one level less


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1346#discussion_r15481390
  
    --- Diff: python/pyspark/sql.py ---
    @@ -20,8 +20,413 @@
     
     from py4j.protocol import Py4JError
     
    -__all__ = ["SQLContext", "HiveContext", "LocalHiveContext", "TestHiveContext", "SchemaRDD", "Row"]
    +__all__ = [
    +    "StringType", "BinaryType", "BooleanType", "DecimalType", "DoubleType",
    +    "FloatType", "ByteType", "IntegerType", "LongType", "ShortType",
    +    "ArrayType", "MapType", "StructField", "StructType",
    +    "SQLContext", "HiveContext", "LocalHiveContext", "TestHiveContext", "SchemaRDD", "Row"]
     
    +class PrimitiveTypeSingleton(type):
    +    _instances = {}
    +    def __call__(cls):
    +        if cls not in cls._instances:
    +            cls._instances[cls] = super(PrimitiveTypeSingleton, cls).__call__()
    +        return cls._instances[cls]
    +
    +class StringType(object):
    +    """Spark SQL StringType
    +
    +    The data type representing string values.
    +
    +    """
    +    __metaclass__ = PrimitiveTypeSingleton
    +
    +    def _get_scala_type_string(self):
    +        return "StringType"
    +
    +class BinaryType(object):
    +    """Spark SQL BinaryType
    +
    +    The data type representing bytes values and bytearray values.
    +
    +    """
    +    __metaclass__ = PrimitiveTypeSingleton
    +
    +    def _get_scala_type_string(self):
    +        return "BinaryType"
    +
    +class BooleanType(object):
    +    """Spark SQL BooleanType
    +
    +    The data type representing bool values.
    +
    +    """
    +    __metaclass__ = PrimitiveTypeSingleton
    +
    +    def _get_scala_type_string(self):
    +        return "BooleanType"
    +
    +class TimestampType(object):
    +    """Spark SQL TimestampType"""
    +    __metaclass__ = PrimitiveTypeSingleton
    +
    +    def _get_scala_type_string(self):
    +        return "TimestampType"
    +
    +class DecimalType(object):
    +    """Spark SQL DecimalType
    +
    +    The data type representing decimal.Decimal values.
    +
    +    """
    +    __metaclass__ = PrimitiveTypeSingleton
    +
    +    def _get_scala_type_string(self):
    +        return "DecimalType"
    +
    +class DoubleType(object):
    +    """Spark SQL DoubleType
    +
    +    The data type representing float values. Because a float value
    +
    +    """
    +    __metaclass__ = PrimitiveTypeSingleton
    +
    +    def _get_scala_type_string(self):
    +        return "DoubleType"
    +
    +class FloatType(object):
    +    """Spark SQL FloatType
    +
    +    For PySpark, please use L{DoubleType} instead of using L{FloatType}.
    --- End diff --
    
    Why?  What if they know the values are limited to the float range and want to use less memory?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [WIP][SPARK-2179][SQL] Public API for DataType...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1346#issuecomment-50111861
  
    QA results for PR 1346:<br>- This patch PASSES unit tests.<br>- This patch merges cleanly<br>- This patch adds no public classes<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17160/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1346#discussion_r15511457
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTypes.scala ---
    @@ -140,10 +142,12 @@ private[parquet] object ParquetTypesConverter extends Logging {
                 assert(keyValueGroup.getFields.apply(0).getRepetition == Repetition.REQUIRED)
                 val valueType = toDataType(keyValueGroup.getFields.apply(1))
                 assert(keyValueGroup.getFields.apply(1).getRepetition == Repetition.REQUIRED)
    -            new MapType(keyType, valueType)
    +            // TODO: set valueContainsNull explicitly instead of assuming valueContainsNull is true
    +            // at here.
    +            MapType(keyType, valueType)
               } else if (correspondsToArray(groupType)) { // ArrayType
                 val elementType = toDataType(groupType.getFields.apply(0))
    -            new ArrayType(elementType)
    +            ArrayType(elementType, false)
    --- End diff --
    
    here too


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1346#discussion_r15511210
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/api/java/types/DataType.java ---
    @@ -0,0 +1,212 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.api.java.types;
    +
    +import java.util.HashSet;
    +import java.util.List;
    +import java.util.Set;
    +
    +/**
    + * The base type of all Spark SQL data types.
    + *
    + * To get/create specific data type, users should use singleton objects and factory methods
    + * provided by this class.
    + */
    +public abstract class DataType {
    +
    +  /**
    +   * Gets the StringType object.
    +   */
    +  public static final StringType StringType = new StringType();
    +
    +  /**
    +   * Gets the BinaryType object.
    +   */
    +  public static final BinaryType BinaryType = new BinaryType();
    +
    +  /**
    +   * Gets the BooleanType object.
    +   */
    +  public static final BooleanType BooleanType = new BooleanType();
    +
    +  /**
    +   * Gets the TimestampType object.
    +   */
    +  public static final TimestampType TimestampType = new TimestampType();
    +
    +  /**
    +   * Gets the DecimalType object.
    +   */
    +  public static final DecimalType DecimalType = new DecimalType();
    +
    +  /**
    +   * Gets the DoubleType object.
    +   */
    +  public static final DoubleType DoubleType = new DoubleType();
    +
    +  /**
    +   * Gets the FloatType object.
    +   */
    +  public static final FloatType FloatType = new FloatType();
    +
    +  /**
    +   * Gets the ByteType object.
    +   */
    +  public static final ByteType ByteType = new ByteType();
    +
    +  /**
    +   * Gets the IntegerType object.
    +   */
    +  public static final IntegerType IntegerType = new IntegerType();
    +
    +  /**
    +   * Gets the LongType object.
    +   */
    +  public static final LongType LongType = new LongType();
    +
    +  /**
    +   * Gets the ShortType object.
    +   */
    +  public static final ShortType ShortType = new ShortType();
    +
    +  /**
    +   * Creates an ArrayType by specifying the data type of elements ({@code elementType}).
    +   * The field of {@code containsNull} is set to {@code false}.
    +   *
    +   * @param elementType
    +   * @return
    +   */
    +  public static ArrayType createArrayType(DataType elementType) {
    +    if (elementType == null) {
    +      throw new IllegalArgumentException("elementType should not be null.");
    +    }
    +
    +    return new ArrayType(elementType, false);
    +  }
    +
    +  /**
    +   * Creates an ArrayType by specifying the data type of elements ({@code elementType}) and
    +   * whether the array contains null values ({@code containsNull}).
    +   * @param elementType
    +   * @param containsNull
    +   * @return
    +   */
    +  public static ArrayType createArrayType(DataType elementType, boolean containsNull) {
    +    if (elementType == null) {
    +      throw new IllegalArgumentException("elementType should not be null.");
    +    }
    +
    +    return new ArrayType(elementType, containsNull);
    +  }
    +
    +  /**
    +   * Creates a MapType by specifying the data type of keys ({@code keyType}) and values
    +   * ({@code keyType}). The field of {@code valueContainsNull} is set to {@code true}.
    +   *
    +   * @param keyType
    +   * @param valueType
    +   * @return
    --- End diff --
    
    actually also params. if you don't explain any of them, just remove them.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1346#discussion_r15481675
  
    --- Diff: python/pyspark/sql.py ---
    @@ -107,6 +512,25 @@ def inferSchema(self, rdd):
             srdd = self._ssql_ctx.inferSchema(jrdd.rdd())
             return SchemaRDD(srdd, self)
     
    +    def applySchema(self, rdd, schema):
    +        """Applies the given schema to the given RDD of L{dict}s.
    --- End diff --
    
    Are we still allowing dicts?  I thought there was at least going to be a warning? Or is this going to change with @davies PR?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

Posted by chenghao-intel <gi...@git.apache.org>.

Github user chenghao-intel commented on the pull request:

    https://github.com/apache/spark/pull/1346#issuecomment-50590014
  
    Thank you @yhuai for the explanation.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1346#discussion_r15551776
  
    --- Diff: python/pyspark/sql.py ---
    @@ -20,8 +20,457 @@
     
     from py4j.protocol import Py4JError
     
    -__all__ = ["SQLContext", "HiveContext", "LocalHiveContext", "TestHiveContext", "SchemaRDD", "Row"]
    +__all__ = [
    +    "StringType", "BinaryType", "BooleanType", "TimestampType", "DecimalType",
    +    "DoubleType", "FloatType", "ByteType", "IntegerType", "LongType",
    +    "ShortType", "ArrayType", "MapType", "StructField", "StructType",
    +    "SQLContext", "HiveContext", "LocalHiveContext", "TestHiveContext", "SchemaRDD", "Row"]
     
    +class PrimitiveTypeSingleton(type):
    +    _instances = {}
    +    def __call__(cls):
    +        if cls not in cls._instances:
    +            cls._instances[cls] = super(PrimitiveTypeSingleton, cls).__call__()
    +        return cls._instances[cls]
    +
    +class StringType(object):
    +    """Spark SQL StringType
    +
    +    The data type representing string values.
    +
    +    """
    +    __metaclass__ = PrimitiveTypeSingleton
    +
    +    def __repr__(self):
    +        return "StringType"
    +
    +class BinaryType(object):
    +    """Spark SQL BinaryType
    +
    +    The data type representing bytearray values.
    +
    +    """
    +    __metaclass__ = PrimitiveTypeSingleton
    +
    +    def __repr__(self):
    +        return "BinaryType"
    +
    +class BooleanType(object):
    +    """Spark SQL BooleanType
    +
    +    The data type representing bool values.
    +
    +    """
    +    __metaclass__ = PrimitiveTypeSingleton
    +
    +    def __repr__(self):
    +        return "BooleanType"
    +
    +class TimestampType(object):
    +    """Spark SQL TimestampType
    +
    +    The data type representing datetime.datetime values.
    +
    +    """
    +    __metaclass__ = PrimitiveTypeSingleton
    +
    +    def __repr__(self):
    +        return "TimestampType"
    +
    +class DecimalType(object):
    +    """Spark SQL DecimalType
    +
    +    The data type representing decimal.Decimal values.
    +
    +    """
    +    __metaclass__ = PrimitiveTypeSingleton
    +
    +    def __repr__(self):
    +        return "DecimalType"
    +
    +class DoubleType(object):
    +    """Spark SQL DoubleType
    +
    +    The data type representing float values.
    +
    +    """
    +    __metaclass__ = PrimitiveTypeSingleton
    +
    +    def __repr__(self):
    +        return "DoubleType"
    +
    +class FloatType(object):
    +    """Spark SQL FloatType
    +
    +    For now, please use L{DoubleType} instead of using L{FloatType}.
    +    Because query evaluation is done in Scala, java.lang.Double will be be used
    +    for Python float numbers. Because the underlying JVM type of FloatType is
    +    java.lang.Float (in Java) and Float (in scala), there will be a java.lang.ClassCastException
    +    if FloatType (Python) is used.
    +
    +    """
    +    __metaclass__ = PrimitiveTypeSingleton
    +
    +    def __repr__(self):
    +        return "FloatType"
    +
    +class ByteType(object):
    +    """Spark SQL ByteType
    +
    +    For now, please use L{IntegerType} instead of using L{ByteType}.
    +    Because query evaluation is done in Scala, java.lang.Integer will be be used
    +    for Python int numbers. Because the underlying JVM type of ByteType is
    +    java.lang.Byte (in Java) and Byte (in scala), there will be a java.lang.ClassCastException
    +    if ByteType (Python) is used.
    +
    +    """
    +    __metaclass__ = PrimitiveTypeSingleton
    +
    +    def __repr__(self):
    +        return "ByteType"
    +
    +class IntegerType(object):
    +    """Spark SQL IntegerType
    +
    +    The data type representing int values.
    +
    +    """
    +    __metaclass__ = PrimitiveTypeSingleton
    +
    +    def __repr__(self):
    +        return "IntegerType"
    +
    +class LongType(object):
    +    """Spark SQL LongType
    +
    +    The data type representing long values. If the any value is beyond the range of
    +    [-9223372036854775808, 9223372036854775807], please use DecimalType.
    +
    +    """
    +    __metaclass__ = PrimitiveTypeSingleton
    +
    +    def __repr__(self):
    +        return "LongType"
    +
    +class ShortType(object):
    +    """Spark SQL ShortType
    +
    +    For now, please use L{IntegerType} instead of using L{ShortType}.
    --- End diff --
    
    In Java/Scala, when user loads data from csv file, they need to do this kind of type conversion,  it will be better if we could do this for them automatically.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1346#issuecomment-50452464
  
    QA results for PR 1346:<br>- This patch PASSES unit tests.<br>- This patch merges cleanly<br>- This patch adds no public classes<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17344/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [WIP][SPARK-2179][SQL] Public API for DataType...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1346#issuecomment-48792410
  
    QA tests have started for PR 1346. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16578/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [WIP][SPARK-2179][SQL] Public API for DataType...

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1346#discussion_r14904015
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/types/dataTypes.scala ---
    @@ -197,47 +213,145 @@ object FractionalType {
       }
     }
     abstract class FractionalType extends NumericType {
    -  val fractional: Fractional[JvmType]
    +  private[sql] val fractional: Fractional[JvmType]
     }
     
     case object DecimalType extends FractionalType {
    -  type JvmType = BigDecimal
    -  @transient lazy val tag = typeTag[JvmType]
    -  val numeric = implicitly[Numeric[BigDecimal]]
    -  val fractional = implicitly[Fractional[BigDecimal]]
    -  val ordering = implicitly[Ordering[JvmType]]
    +  private[sql] type JvmType = BigDecimal
    +  @transient private[sql] lazy val tag = typeTag[JvmType]
    +  private[sql] val numeric = implicitly[Numeric[BigDecimal]]
    +  private[sql] val fractional = implicitly[Fractional[BigDecimal]]
    +  private[sql] val ordering = implicitly[Ordering[JvmType]]
    +  def simpleString: String = "decimal"
     }
     
     case object DoubleType extends FractionalType {
    -  type JvmType = Double
    -  @transient lazy val tag = typeTag[JvmType]
    -  val numeric = implicitly[Numeric[Double]]
    -  val fractional = implicitly[Fractional[Double]]
    -  val ordering = implicitly[Ordering[JvmType]]
    +  private[sql] type JvmType = Double
    +  @transient private[sql] lazy val tag = typeTag[JvmType]
    +  private[sql] val numeric = implicitly[Numeric[Double]]
    +  private[sql] val fractional = implicitly[Fractional[Double]]
    +  private[sql] val ordering = implicitly[Ordering[JvmType]]
    +  def simpleString: String = "double"
     }
     
     case object FloatType extends FractionalType {
    -  type JvmType = Float
    -  @transient lazy val tag = typeTag[JvmType]
    -  val numeric = implicitly[Numeric[Float]]
    -  val fractional = implicitly[Fractional[Float]]
    -  val ordering = implicitly[Ordering[JvmType]]
    +  private[sql] type JvmType = Float
    +  @transient private[sql] lazy val tag = typeTag[JvmType]
    +  private[sql] val numeric = implicitly[Numeric[Float]]
    +  private[sql] val fractional = implicitly[Fractional[Float]]
    +  private[sql] val ordering = implicitly[Ordering[JvmType]]
    +  def simpleString: String = "float"
     }
     
    -case class ArrayType(elementType: DataType) extends DataType
    +object ArrayType {
    +  def apply(elementType: DataType): ArrayType = ArrayType(elementType, false)
    +}
     
    -case class StructField(name: String, dataType: DataType, nullable: Boolean)
    +case class ArrayType(elementType: DataType, containsNull: Boolean) extends DataType {
    +  private[sql] def buildFormattedString(prefix: String, builder: StringBuilder): Unit = {
    +    builder.append(
    +      s"${prefix}-- element: ${elementType.simpleString} (containsNull = ${containsNull})\n")
    +    elementType match {
    +      case array: ArrayType =>
    +        array.buildFormattedString(s"$prefix    |", builder)
    +      case struct: StructType =>
    +        struct.buildFormattedString(s"$prefix    |", builder)
    +      case map: MapType =>
    +        map.buildFormattedString(s"$prefix    |", builder)
    +      case _ =>
    +    }
    +  }
    +
    +  def simpleString: String = "array"
    +}
    +
    +case class StructField(name: String, dataType: DataType, nullable: Boolean) {
    +
    +  private[sql] def buildFormattedString(prefix: String, builder: StringBuilder): Unit = {
    +    builder.append(s"${prefix}-- ${name}: ${dataType.simpleString} (nullable = ${nullable})\n")
    +    dataType match {
    +      case array: ArrayType =>
    +        array.buildFormattedString(s"$prefix    |", builder)
    +      case struct: StructType =>
    +        struct.buildFormattedString(s"$prefix    |", builder)
    +      case map: MapType =>
    +        map.buildFormattedString(s"$prefix    |", builder)
    +      case _ =>
    +    }
    +  }
    +}
     
     object StructType {
    -  def fromAttributes(attributes: Seq[Attribute]): StructType = {
    +  def fromAttributes(attributes: Seq[Attribute]): StructType =
         StructType(attributes.map(a => StructField(a.name, a.dataType, a.nullable)))
    -  }
     
    -  // def apply(fields: Seq[StructField]) = new StructType(fields.toIndexedSeq)
    +  private def validateFields(fields: Seq[StructField]): Boolean =
    +    fields.map(field => field.name).distinct.size == fields.size
    +
    +  def apply[A <: String: ClassTag, B <: DataType: ClassTag](fields: (A, B)*): StructType =
    +    StructType(fields.map(field => StructField(field._1, field._2, true)))
    +
    +  def apply[A <: String: ClassTag, B <: DataType: ClassTag, C <: Boolean: ClassTag](
    +      fields: (A, B, C)*): StructType =
    +    StructType(fields.map(field => StructField(field._1, field._2, field._3)))
     }
     
     case class StructType(fields: Seq[StructField]) extends DataType {
    +  require(StructType.validateFields(fields), "Found fields with the same name.")
    +
    +  def apply(name: String): StructField = {
    +    fields.find(f => f.name == name).orNull
    +  }
    +
    +  def apply(names: String*): StructType = {
    +    val nameSet = names.toSet
    +    StructType(fields.filter(f => nameSet.contains(f.name)))
    +  }
    +
       def toAttributes = fields.map(f => AttributeReference(f.name, f.dataType, f.nullable)())
    +
    +  def schemaString: String = {
    +    val builder = new StringBuilder
    +    builder.append("root\n")
    +    val prefix = " |"
    +    fields.foreach(field => field.buildFormattedString(prefix, builder))
    +
    +    builder.toString()
    +  }
    +
    +  def printSchema(): Unit = println(schemaString)
    +
    +  private[sql] def buildFormattedString(prefix: String, builder: StringBuilder): Unit = {
    +    fields.foreach(field => field.buildFormattedString(prefix, builder))
    +  }
    +
    +  def simpleString: String = "struct"
     }
     
    -case class MapType(keyType: DataType, valueType: DataType) extends DataType
    +case class MapType(keyType: DataType, valueType: DataType) extends DataType {
    +  private[sql] def buildFormattedString(prefix: String, builder: StringBuilder): Unit = {
    +    builder.append(s"${prefix}-- key: ${keyType.simpleString}\n")
    +    keyType match {
    --- End diff --
    
    This matching code is duplicated like 4 times AFAICT.  Perhaps it could just be a protected function in DataType.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1346#discussion_r15481312
  
    --- Diff: python/pyspark/sql.py ---
    @@ -20,8 +20,413 @@
     
     from py4j.protocol import Py4JError
     
    -__all__ = ["SQLContext", "HiveContext", "LocalHiveContext", "TestHiveContext", "SchemaRDD", "Row"]
    +__all__ = [
    +    "StringType", "BinaryType", "BooleanType", "DecimalType", "DoubleType",
    +    "FloatType", "ByteType", "IntegerType", "LongType", "ShortType",
    +    "ArrayType", "MapType", "StructField", "StructType",
    +    "SQLContext", "HiveContext", "LocalHiveContext", "TestHiveContext", "SchemaRDD", "Row"]
     
    +class PrimitiveTypeSingleton(type):
    +    _instances = {}
    +    def __call__(cls):
    +        if cls not in cls._instances:
    +            cls._instances[cls] = super(PrimitiveTypeSingleton, cls).__call__()
    +        return cls._instances[cls]
    +
    +class StringType(object):
    +    """Spark SQL StringType
    +
    +    The data type representing string values.
    +
    +    """
    +    __metaclass__ = PrimitiveTypeSingleton
    +
    +    def _get_scala_type_string(self):
    +        return "StringType"
    +
    +class BinaryType(object):
    +    """Spark SQL BinaryType
    +
    +    The data type representing bytes values and bytearray values.
    +
    +    """
    +    __metaclass__ = PrimitiveTypeSingleton
    +
    +    def _get_scala_type_string(self):
    +        return "BinaryType"
    +
    +class BooleanType(object):
    +    """Spark SQL BooleanType
    +
    +    The data type representing bool values.
    +
    +    """
    +    __metaclass__ = PrimitiveTypeSingleton
    +
    +    def _get_scala_type_string(self):
    +        return "BooleanType"
    +
    +class TimestampType(object):
    +    """Spark SQL TimestampType"""
    +    __metaclass__ = PrimitiveTypeSingleton
    +
    +    def _get_scala_type_string(self):
    +        return "TimestampType"
    +
    +class DecimalType(object):
    +    """Spark SQL DecimalType
    +
    +    The data type representing decimal.Decimal values.
    +
    +    """
    +    __metaclass__ = PrimitiveTypeSingleton
    +
    +    def _get_scala_type_string(self):
    +        return "DecimalType"
    +
    +class DoubleType(object):
    +    """Spark SQL DoubleType
    +
    +    The data type representing float values. Because a float value
    --- End diff --
    
    This comment isn't finished.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [WIP][SPARK-2179][SQL] Public API for DataType...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1346#issuecomment-48680525
  
    QA results for PR 1346:<br>- This patch FAILED unit tests.<br>- This patch merges cleanly<br>- This patch adds the following public classes (experimental):<br>case class ArrayType(elementType: DataType) extends DataType {<br>case class StructField(name: String, dataType: DataType, nullable: Boolean) {<br>case class MapType(keyType: DataType, valueType: DataType) extends DataType {<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16528/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1346#issuecomment-50576339
  
    QA tests have started for PR 1346. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17423/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1346#discussion_r15492409
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala ---
    @@ -374,4 +444,97 @@ class SQLContext(@transient val sparkContext: SparkContext)
         new SchemaRDD(this, SparkLogicalPlan(ExistingRdd(schema, rowRdd)))
       }
     
    +  /**
    +   * Returns the equivalent StructField in Scala for the given StructField in Java.
    +   */
    +  protected def asJavaStructField(scalaStructField: StructField): JStructField = {
    --- End diff --
    
    Oh, I see.  These are all static functions right?  Maybe we could put them all in a python support object.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

Posted by yhuai <gi...@git.apache.org>.

Github user yhuai commented on the pull request:

    https://github.com/apache/spark/pull/1346#issuecomment-50432240
  
    I am reviewing it. Will have a update soon.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1346#discussion_r15511199
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/api/java/types/DataType.java ---
    @@ -0,0 +1,212 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.api.java.types;
    +
    +import java.util.HashSet;
    +import java.util.List;
    +import java.util.Set;
    +
    +/**
    + * The base type of all Spark SQL data types.
    + *
    + * To get/create specific data type, users should use singleton objects and factory methods
    + * provided by this class.
    + */
    +public abstract class DataType {
    +
    +  /**
    +   * Gets the StringType object.
    +   */
    +  public static final StringType StringType = new StringType();
    +
    +  /**
    +   * Gets the BinaryType object.
    +   */
    +  public static final BinaryType BinaryType = new BinaryType();
    +
    +  /**
    +   * Gets the BooleanType object.
    +   */
    +  public static final BooleanType BooleanType = new BooleanType();
    +
    +  /**
    +   * Gets the TimestampType object.
    +   */
    +  public static final TimestampType TimestampType = new TimestampType();
    +
    +  /**
    +   * Gets the DecimalType object.
    +   */
    +  public static final DecimalType DecimalType = new DecimalType();
    +
    +  /**
    +   * Gets the DoubleType object.
    +   */
    +  public static final DoubleType DoubleType = new DoubleType();
    +
    +  /**
    +   * Gets the FloatType object.
    +   */
    +  public static final FloatType FloatType = new FloatType();
    +
    +  /**
    +   * Gets the ByteType object.
    +   */
    +  public static final ByteType ByteType = new ByteType();
    +
    +  /**
    +   * Gets the IntegerType object.
    +   */
    +  public static final IntegerType IntegerType = new IntegerType();
    +
    +  /**
    +   * Gets the LongType object.
    +   */
    +  public static final LongType LongType = new LongType();
    +
    +  /**
    +   * Gets the ShortType object.
    +   */
    +  public static final ShortType ShortType = new ShortType();
    +
    +  /**
    +   * Creates an ArrayType by specifying the data type of elements ({@code elementType}).
    +   * The field of {@code containsNull} is set to {@code false}.
    +   *
    +   * @param elementType
    +   * @return
    +   */
    +  public static ArrayType createArrayType(DataType elementType) {
    +    if (elementType == null) {
    +      throw new IllegalArgumentException("elementType should not be null.");
    +    }
    +
    +    return new ArrayType(elementType, false);
    +  }
    +
    +  /**
    +   * Creates an ArrayType by specifying the data type of elements ({@code elementType}) and
    +   * whether the array contains null values ({@code containsNull}).
    +   * @param elementType
    +   * @param containsNull
    +   * @return
    +   */
    +  public static ArrayType createArrayType(DataType elementType, boolean containsNull) {
    +    if (elementType == null) {
    +      throw new IllegalArgumentException("elementType should not be null.");
    +    }
    +
    +    return new ArrayType(elementType, containsNull);
    +  }
    +
    +  /**
    +   * Creates a MapType by specifying the data type of keys ({@code keyType}) and values
    +   * ({@code keyType}). The field of {@code valueContainsNull} is set to {@code true}.
    +   *
    +   * @param keyType
    +   * @param valueType
    +   * @return
    --- End diff --
    
    remove the  return tag if you are not going to say anything about it. also remove it for other functions in this pr.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [WIP][SPARK-2179][SQL] Public API for DataType...

Posted by yhuai <gi...@git.apache.org>.

Github user yhuai commented on the pull request:

    https://github.com/apache/spark/pull/1346#issuecomment-48561811
  
    Hi @ueshin, @marmbrus and I discussed about it. We think it is not semantically clear what a null means when it appears in the key or value field (considering a null is used to indicate a missing data value). So, we decided that `key` and `value` in a `MapType` should not contain any null value and we will not introduce `containsNull` to `MapType`. Does it make sense?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

Posted by yhuai <gi...@git.apache.org>.

Github user yhuai commented on the pull request:

    https://github.com/apache/spark/pull/1346#issuecomment-50576308
  
    @chenghao-intel `containsNull` and `valueContainsNull` can be used for further optimization. For example, let's say we have an `ArrayType` column and the element type is `IntegerType`. If elements of those arrays do not have `null` values, we can use a primitive array internal. Since we will expose data types to users, we need to introduce these two booleans with this PR. It can be hard to add them once users start to use these APIs.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1346#issuecomment-50420623
  
    QA tests have started for PR 1346. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17320/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [WIP][SPARK-2179][SQL] Public API for DataType...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1346#issuecomment-48800669
  
    QA results for PR 1346:<br>- This patch PASSES unit tests.<br>- This patch merges cleanly<br>- This patch adds the following public classes (experimental):<br>case class ArrayType(elementType: DataType, containsNull: Boolean) extends DataType {<br>case class StructField(name: String, dataType: DataType, nullable: Boolean) {<br>case class MapType(keyType: DataType, valueType: DataType) extends DataType {<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16582/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1346#issuecomment-50291054
  
    QA tests have started for PR 1346. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17263/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [WIP][SPARK-2179][SQL] Public API for DataType...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1346#issuecomment-48771573
  
    QA tests have started for PR 1346. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16572/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1346#discussion_r15481537
  
    --- Diff: python/pyspark/sql.py ---
    @@ -20,8 +20,413 @@
     
     from py4j.protocol import Py4JError
     
    -__all__ = ["SQLContext", "HiveContext", "LocalHiveContext", "TestHiveContext", "SchemaRDD", "Row"]
    +__all__ = [
    +    "StringType", "BinaryType", "BooleanType", "DecimalType", "DoubleType",
    +    "FloatType", "ByteType", "IntegerType", "LongType", "ShortType",
    +    "ArrayType", "MapType", "StructField", "StructType",
    +    "SQLContext", "HiveContext", "LocalHiveContext", "TestHiveContext", "SchemaRDD", "Row"]
     
    +class PrimitiveTypeSingleton(type):
    +    _instances = {}
    +    def __call__(cls):
    +        if cls not in cls._instances:
    +            cls._instances[cls] = super(PrimitiveTypeSingleton, cls).__call__()
    +        return cls._instances[cls]
    +
    +class StringType(object):
    +    """Spark SQL StringType
    +
    +    The data type representing string values.
    +
    +    """
    +    __metaclass__ = PrimitiveTypeSingleton
    +
    +    def _get_scala_type_string(self):
    +        return "StringType"
    +
    +class BinaryType(object):
    +    """Spark SQL BinaryType
    +
    +    The data type representing bytes values and bytearray values.
    +
    +    """
    +    __metaclass__ = PrimitiveTypeSingleton
    +
    +    def _get_scala_type_string(self):
    +        return "BinaryType"
    +
    +class BooleanType(object):
    +    """Spark SQL BooleanType
    +
    +    The data type representing bool values.
    +
    +    """
    +    __metaclass__ = PrimitiveTypeSingleton
    +
    +    def _get_scala_type_string(self):
    +        return "BooleanType"
    +
    +class TimestampType(object):
    +    """Spark SQL TimestampType"""
    +    __metaclass__ = PrimitiveTypeSingleton
    +
    +    def _get_scala_type_string(self):
    +        return "TimestampType"
    +
    +class DecimalType(object):
    +    """Spark SQL DecimalType
    +
    +    The data type representing decimal.Decimal values.
    +
    +    """
    +    __metaclass__ = PrimitiveTypeSingleton
    +
    +    def _get_scala_type_string(self):
    +        return "DecimalType"
    +
    +class DoubleType(object):
    +    """Spark SQL DoubleType
    +
    +    The data type representing float values. Because a float value
    +
    +    """
    +    __metaclass__ = PrimitiveTypeSingleton
    +
    +    def _get_scala_type_string(self):
    +        return "DoubleType"
    +
    +class FloatType(object):
    +    """Spark SQL FloatType
    +
    +    For PySpark, please use L{DoubleType} instead of using L{FloatType}.
    +
    +    """
    +    __metaclass__ = PrimitiveTypeSingleton
    +
    +    def _get_scala_type_string(self):
    +        return "FloatType"
    +
    +class ByteType(object):
    +    """Spark SQL ByteType
    +
    +    For PySpark, please use L{IntegerType} instead of using L{ByteType}.
    +
    +    """
    +    __metaclass__ = PrimitiveTypeSingleton
    +
    +    def _get_scala_type_string(self):
    +        return "ByteType"
    +
    +class IntegerType(object):
    +    """Spark SQL IntegerType
    +
    +    The data type representing int values.
    +
    +    """
    +    __metaclass__ = PrimitiveTypeSingleton
    +
    +    def _get_scala_type_string(self):
    +        return "IntegerType"
    +
    +class LongType(object):
    +    """Spark SQL LongType
    +
    +    The data type representing long values. If the any value is beyond the range of
    +    [-9223372036854775808, 9223372036854775807], please use DecimalType.
    +
    +    """
    +    __metaclass__ = PrimitiveTypeSingleton
    +
    +    def _get_scala_type_string(self):
    +        return "LongType"
    +
    +class ShortType(object):
    +    """Spark SQL ShortType
    +
    +    For PySpark, please use L{IntegerType} instead of using L{ShortType}.
    +
    +    """
    +    __metaclass__ = PrimitiveTypeSingleton
    +
    +    def _get_scala_type_string(self):
    +        return "ShortType"
    +
    +class ArrayType(object):
    +    """Spark SQL ArrayType
    +
    +    The data type representing list values.
    +
    +    """
    +    def __init__(self, elementType, containsNull):
    +        """Creates an ArrayType
    +
    +        :param elementType: the data type of elements.
    +        :param containsNull: indicates whether the list contains null values.
    +        :return:
    +
    +        >>> ArrayType(StringType, True) == ArrayType(StringType, False)
    +        False
    +        >>> ArrayType(StringType, True) == ArrayType(StringType, True)
    +        True
    +        """
    +        self.elementType = elementType
    +        self.containsNull = containsNull
    +
    +    def _get_scala_type_string(self):
    +        return "ArrayType(" + self.elementType._get_scala_type_string() + "," + \
    +               str(self.containsNull).lower() + ")"
    +
    +    def __eq__(self, other):
    +        return (isinstance(other, self.__class__) and \
    +            self.elementType == other.elementType and \
    +            self.containsNull == other.containsNull)
    +
    +    def __ne__(self, other):
    +        return not self.__eq__(other)
    +
    +
    +class MapType(object):
    +    """Spark SQL MapType
    +
    +    The data type representing dict values.
    +
    +    """
    +    def __init__(self, keyType, valueType):
    --- End diff --
    
    I thought we decided in the meeting that we need to have a null bit for the key and value since hive does.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1346#discussion_r15482239
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/api/java/types/BooleanType.java ---
    @@ -0,0 +1,22 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.api.java.types;
    +
    +public class BooleanType extends DataType {
    --- End diff --
    
    Also perhaps the Java doc should make it clear that users don't instantiate these themselves, but instead get the singletons from the DataType class.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1346#discussion_r15511453
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTypes.scala ---
    @@ -116,7 +116,7 @@ private[parquet] object ParquetTypesConverter extends Logging {
             case ParquetOriginalType.LIST => { // TODO: check enums!
               assert(groupType.getFieldCount == 1)
               val field = groupType.getFields.apply(0)
    -          new ArrayType(toDataType(field))
    +          ArrayType(toDataType(field), false)
    --- End diff --
    
    for boolean argument, make them named argument. e.g. 
    ```scala
    ArrayType(toDataType(field), nullable = false)  // maybe it was containsNull
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1346#discussion_r15483058
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala ---
    @@ -374,4 +444,97 @@ class SQLContext(@transient val sparkContext: SparkContext)
         new SchemaRDD(this, SparkLogicalPlan(ExistingRdd(schema, rowRdd)))
       }
     
    +  /**
    +   * Returns the equivalent StructField in Scala for the given StructField in Java.
    +   */
    +  protected def asJavaStructField(scalaStructField: StructField): JStructField = {
    --- End diff --
    
    Same for the functions below.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on the pull request:

    https://github.com/apache/spark/pull/1346#issuecomment-50292497
  
    @yhuai awesome! I will update my diff to use this API.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

Posted by concretevitamin <gi...@git.apache.org>.

Github user concretevitamin commented on the pull request:

    https://github.com/apache/spark/pull/1346#issuecomment-50423851
  
    To add to this: for my own purpose, I can certainly hack something together based off this branch in a custom Spark build, but just want to throw this thought out there as I think it does have some generality (large number of columns, avoid writing `.map(p => Row(p(0), p(1), ..., p(LARGE_NUM)))`).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1346#issuecomment-50533105
  
    QA results for PR 1346:<br>- This patch FAILED unit tests.<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17374/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1346#discussion_r15481834
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/BoundAttribute.scala ---
    @@ -17,11 +17,12 @@
     
     package org.apache.spark.sql.catalyst.expressions
     
    +import com.typesafe.scalalogging.slf4j.Logging
    --- End diff --
    
    We should use either Spark Logging or Spark SQL logging. (Ideally we will be removing catalyst's dependence on Spark solely for the logging code, but I'm okay with either ATM.)  We shouldn't hard code this library here though.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [WIP][SPARK-2179][SQL] Public API for DataType...

Posted by concretevitamin <gi...@git.apache.org>.

Github user concretevitamin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1346#discussion_r14735318
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala ---
    @@ -89,6 +89,16 @@ class SQLContext(@transient val sparkContext: SparkContext)
         new SchemaRDD(this, SparkLogicalPlan(ExistingRdd.fromProductRdd(rdd)))
     
       /**
    +   * Creates a SchemaRDD from an RDD by applying a schema and providing a function to construct
    +   * a Row from a RDD record.
    +   *
    +   * @group userf
    +   */
    +  def createSchemaRDD[A](rdd: RDD[A], schema: StructType, constructRow: A => Row) = {
    --- End diff --
    
    Naming nit: functional lang usually uses "make", moreover SparkContext already has a public `makeRDD`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1346#discussion_r15555549
  
    --- Diff: python/pyspark/sql.py ---
    @@ -20,8 +20,457 @@
     
     from py4j.protocol import Py4JError
     
    -__all__ = ["SQLContext", "HiveContext", "LocalHiveContext", "TestHiveContext", "SchemaRDD", "Row"]
    +__all__ = [
    +    "StringType", "BinaryType", "BooleanType", "TimestampType", "DecimalType",
    +    "DoubleType", "FloatType", "ByteType", "IntegerType", "LongType",
    +    "ShortType", "ArrayType", "MapType", "StructField", "StructType",
    +    "SQLContext", "HiveContext", "LocalHiveContext", "TestHiveContext", "SchemaRDD", "Row"]
     
    +class PrimitiveTypeSingleton(type):
    +    _instances = {}
    +    def __call__(cls):
    +        if cls not in cls._instances:
    +            cls._instances[cls] = super(PrimitiveTypeSingleton, cls).__call__()
    +        return cls._instances[cls]
    +
    +class StringType(object):
    +    """Spark SQL StringType
    +
    +    The data type representing string values.
    +
    +    """
    +    __metaclass__ = PrimitiveTypeSingleton
    +
    +    def __repr__(self):
    +        return "StringType"
    +
    +class BinaryType(object):
    +    """Spark SQL BinaryType
    +
    +    The data type representing bytearray values.
    +
    +    """
    +    __metaclass__ = PrimitiveTypeSingleton
    +
    +    def __repr__(self):
    +        return "BinaryType"
    +
    +class BooleanType(object):
    +    """Spark SQL BooleanType
    +
    +    The data type representing bool values.
    +
    +    """
    +    __metaclass__ = PrimitiveTypeSingleton
    +
    +    def __repr__(self):
    +        return "BooleanType"
    +
    +class TimestampType(object):
    +    """Spark SQL TimestampType
    +
    +    The data type representing datetime.datetime values.
    +
    +    """
    +    __metaclass__ = PrimitiveTypeSingleton
    +
    +    def __repr__(self):
    +        return "TimestampType"
    +
    +class DecimalType(object):
    +    """Spark SQL DecimalType
    +
    +    The data type representing decimal.Decimal values.
    +
    +    """
    +    __metaclass__ = PrimitiveTypeSingleton
    +
    +    def __repr__(self):
    +        return "DecimalType"
    +
    +class DoubleType(object):
    +    """Spark SQL DoubleType
    +
    +    The data type representing float values.
    +
    +    """
    +    __metaclass__ = PrimitiveTypeSingleton
    +
    +    def __repr__(self):
    +        return "DoubleType"
    +
    +class FloatType(object):
    +    """Spark SQL FloatType
    +
    +    For now, please use L{DoubleType} instead of using L{FloatType}.
    +    Because query evaluation is done in Scala, java.lang.Double will be be used
    +    for Python float numbers. Because the underlying JVM type of FloatType is
    +    java.lang.Float (in Java) and Float (in scala), there will be a java.lang.ClassCastException
    +    if FloatType (Python) is used.
    +
    +    """
    +    __metaclass__ = PrimitiveTypeSingleton
    +
    +    def __repr__(self):
    +        return "FloatType"
    +
    +class ByteType(object):
    +    """Spark SQL ByteType
    +
    +    For now, please use L{IntegerType} instead of using L{ByteType}.
    +    Because query evaluation is done in Scala, java.lang.Integer will be be used
    +    for Python int numbers. Because the underlying JVM type of ByteType is
    +    java.lang.Byte (in Java) and Byte (in scala), there will be a java.lang.ClassCastException
    +    if ByteType (Python) is used.
    +
    +    """
    +    __metaclass__ = PrimitiveTypeSingleton
    +
    +    def __repr__(self):
    +        return "ByteType"
    +
    +class IntegerType(object):
    +    """Spark SQL IntegerType
    +
    +    The data type representing int values.
    +
    +    """
    +    __metaclass__ = PrimitiveTypeSingleton
    +
    +    def __repr__(self):
    +        return "IntegerType"
    +
    +class LongType(object):
    +    """Spark SQL LongType
    +
    +    The data type representing long values. If the any value is beyond the range of
    +    [-9223372036854775808, 9223372036854775807], please use DecimalType.
    +
    +    """
    +    __metaclass__ = PrimitiveTypeSingleton
    +
    +    def __repr__(self):
    +        return "LongType"
    +
    +class ShortType(object):
    +    """Spark SQL ShortType
    +
    +    For now, please use L{IntegerType} instead of using L{ShortType}.
    --- End diff --
    
    Yes, let's think about it later.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1346#discussion_r15482163
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/api/java/types/BooleanType.java ---
    @@ -0,0 +1,22 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.api.java.types;
    +
    +public class BooleanType extends DataType {
    --- End diff --
    
    Missing Java Doc.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1346#discussion_r15483619
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/package.scala ---
    @@ -0,0 +1,401 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark
    +
    +import org.apache.spark.annotation.DeveloperApi
    +
    +/**
    + * Allows the execution of relational queries, including those expressed in SQL using Spark.
    + *
    + *  @groupname dataType Data types
    + *  @groupdesc Spark SQL data types.
    + *  @groupprio dataType -3
    + *  @groupname field Field
    + *  @groupprio field -2
    + *  @groupname row Row
    + *  @groupprio row -1
    + */
    +package object sql {
    +
    +  protected[sql] type Logging = com.typesafe.scalalogging.slf4j.Logging
    +
    +  /**
    +   * :: DeveloperApi ::
    +   *
    +   * Represents one row of output from a relational operator.
    +   * @group row
    +   */
    +  @DeveloperApi
    +  type Row = catalyst.expressions.Row
    +
    +  /**
    +   * :: DeveloperApi ::
    +   *
    +   * A [[Row]] object can be constructed by providing field values. Example:
    +   * {{{
    +   * import org.apache.spark.sql._
    +   *
    +   * // Create a Row from values.
    +   * Row(value1, value2, value3, ...)
    +   * // Create a Row from a Seq of values.
    +   * Row.fromSeq(Seq(value1, value2, ...))
    +   * }}}
    +   *
    +   * A value of a row can be accessed through both generic access by ordinal,
    +   * which will incur boxing overhead for primitives, as well as native primitive access.
    +   * An example of generic access by ordinal:
    +   * {{{
    +   * import org.apache.spark.sql._
    +   *
    +   * val row = Row(1, true, "a string", null)
    +   * // row: Row = [1,true,a string,null]
    +   * val firstValue = row(0)
    +   * // firstValue: Any = 1
    +   * val fourthValue = row(3)
    +   * // fourthValue: Any = null
    +   * }}}
    +   *
    +   * For native primitive access, it is invalid to use the native primitive interface to retrieve
    +   * a value that is null, instead a user must check `isNullAt` before attempting to retrieve a
    +   * value that might be null.
    +   * An example of native primitive access:
    +   * {{{
    +   * // using the row from the previous example.
    +   * val firstValue = row.getInt(0)
    +   * // firstValue: Int = 1
    +   * val isNull = row.isNullAt(3)
    +   * // isNull: Boolean = true
    +   * }}}
    +   *
    +   * Interfaces related to native primitive access are:
    +   *
    +   * `isNullAt(i: Int): Boolean`
    +   *
    +   * `getInt(i: Int): Int`
    +   *
    +   * `getLong(i: Int): Long`
    +   *
    +   * `getDouble(i: Int): Double`
    +   *
    +   * `getFloat(i: Int): Float`
    +   *
    +   * `getBoolean(i: Int): Boolean`
    +   *
    +   * `getShort(i: Int): Short`
    +   *
    +   * `getByte(i: Int): Byte`
    +   *
    +   * `getString(i: Int): String`
    +   *
    +   * Fields in a [[Row]] object can be extracted in a pattern match. Example:
    +   * {{{
    +   * import org.apache.spark.sql._
    +   *
    +   * val pairs = sql("SELECT key, value FROM src").rdd.map {
    +   *   case Row(key: Int, value: String) =>
    +   *     key -> value
    +   * }
    +   * }}}
    +   *
    +   * @group row
    +   */
    +  @DeveloperApi
    +  val Row = catalyst.expressions.Row
    +
    +  /**
    +   * :: DeveloperApi ::
    +   *
    +   * The base type of all Spark SQL data types.
    +   *
    +   * @group dataType
    +   */
    +  @DeveloperApi
    +  type DataType = catalyst.types.DataType
    +
    +  /**
    +   * :: DeveloperApi ::
    +   *
    +   * The data type representing `String` values
    +   *
    +   * @group dataType
    +   */
    +  @DeveloperApi
    +  val StringType = catalyst.types.StringType
    +
    +  /**
    +   * :: DeveloperApi ::
    +   *
    +   * The data type representing `Array[Byte]` values.
    +   *
    +   * @group dataType
    +   */
    +  @DeveloperApi
    +  val BinaryType = catalyst.types.BinaryType
    +
    +  /**
    +   * :: DeveloperApi ::
    +   *
    +   * The data type representing `Boolean` values.
    +   *
    +   *@group dataType
    +   */
    +  @DeveloperApi
    +  val BooleanType = catalyst.types.BooleanType
    +
    +  /**
    +   * :: DeveloperApi ::
    +   *
    +   * The data type representing `java.sql.Timestamp` values.
    +   *
    +   * @group dataType
    +   */
    +  @DeveloperApi
    +  val TimestampType = catalyst.types.TimestampType
    +
    +  /**
    +   * :: DeveloperApi ::
    +   *
    +   * The data type representing `scala.math.BigDecimal` values.
    +   *
    +   * @group dataType
    +   */
    +  @DeveloperApi
    +  val DecimalType = catalyst.types.DecimalType
    +
    +  /**
    +   * :: DeveloperApi ::
    +   *
    +   * The data type representing `Double` values.
    +   *
    +   * @group dataType
    +   */
    +  @DeveloperApi
    +  val DoubleType = catalyst.types.DoubleType
    +
    +  /**
    +   * :: DeveloperApi ::
    +   *
    +   * The data type representing `Float` values.
    +   *
    +   * @group dataType
    +   */
    +  @DeveloperApi
    +  val FloatType = catalyst.types.FloatType
    +
    +  /**
    +   * :: DeveloperApi ::
    +   *
    +   * The data type representing `Byte` values.
    +   *
    +   * @group dataType
    +   */
    +  @DeveloperApi
    +  val ByteType = catalyst.types.ByteType
    +
    +  /**
    +   * :: DeveloperApi ::
    +   *
    +   * The data type representing `Int` values.
    +   *
    +   * @group dataType
    +   */
    +  @DeveloperApi
    +  val IntegerType = catalyst.types.IntegerType
    +
    +  /**
    +   * :: DeveloperApi ::
    +   *
    +   * The data type representing `Long` values.
    +   *
    +   * @group dataType
    +   */
    +  @DeveloperApi
    +  val LongType = catalyst.types.LongType
    +
    +  /**
    +   * :: DeveloperApi ::
    +   *
    +   * The data type representing `Short` values.
    +   *
    +   * @group dataType
    +   */
    +  @DeveloperApi
    +  val ShortType = catalyst.types.ShortType
    +
    +  /**
    +   * :: DeveloperApi ::
    +   *
    +   * The data type representing `Seq`s.
    --- End diff --
    
    How about: "The DataType for collections of multiple values.  Internally these are represented as columns that contain a `scala.collection.Seq`."


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [WIP][SPARK-2179][SQL] Public API for DataType...

Posted by yhuai <gi...@git.apache.org>.

Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1346#discussion_r14904132
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/package.scala ---
    @@ -33,4 +33,48 @@ package object sql {
       type Row = catalyst.expressions.Row
     
       val Row = catalyst.expressions.Row
    +
    +  type DataType = catalyst.types.DataType
    +
    +  val DataType = catalyst.types.DataType
    +
    +  val NullType = catalyst.types.NullType
    --- End diff --
    
    I think we do not want to expose NullType. I will remove it in my next update.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [WIP][SPARK-2179][SQL] Public API for DataType...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/1346#issuecomment-48523589
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [WIP][SPARK-2179][SQL] Public API for DataType...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/1346#issuecomment-48535194
  
    Merged build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

Posted by yhuai <gi...@git.apache.org>.

Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1346#discussion_r15550525
  
    --- Diff: python/pyspark/sql.py ---
    @@ -20,8 +20,457 @@
     
     from py4j.protocol import Py4JError
     
    -__all__ = ["SQLContext", "HiveContext", "LocalHiveContext", "TestHiveContext", "SchemaRDD", "Row"]
    +__all__ = [
    +    "StringType", "BinaryType", "BooleanType", "TimestampType", "DecimalType",
    +    "DoubleType", "FloatType", "ByteType", "IntegerType", "LongType",
    +    "ShortType", "ArrayType", "MapType", "StructField", "StructType",
    +    "SQLContext", "HiveContext", "LocalHiveContext", "TestHiveContext", "SchemaRDD", "Row"]
     
    +class PrimitiveTypeSingleton(type):
    +    _instances = {}
    +    def __call__(cls):
    +        if cls not in cls._instances:
    +            cls._instances[cls] = super(PrimitiveTypeSingleton, cls).__call__()
    +        return cls._instances[cls]
    +
    +class StringType(object):
    +    """Spark SQL StringType
    +
    +    The data type representing string values.
    +
    +    """
    +    __metaclass__ = PrimitiveTypeSingleton
    +
    +    def __repr__(self):
    +        return "StringType"
    +
    +class BinaryType(object):
    +    """Spark SQL BinaryType
    +
    +    The data type representing bytearray values.
    +
    +    """
    +    __metaclass__ = PrimitiveTypeSingleton
    +
    +    def __repr__(self):
    +        return "BinaryType"
    +
    +class BooleanType(object):
    +    """Spark SQL BooleanType
    +
    +    The data type representing bool values.
    +
    +    """
    +    __metaclass__ = PrimitiveTypeSingleton
    +
    +    def __repr__(self):
    +        return "BooleanType"
    +
    +class TimestampType(object):
    +    """Spark SQL TimestampType
    +
    +    The data type representing datetime.datetime values.
    +
    +    """
    +    __metaclass__ = PrimitiveTypeSingleton
    +
    +    def __repr__(self):
    +        return "TimestampType"
    +
    +class DecimalType(object):
    +    """Spark SQL DecimalType
    +
    +    The data type representing decimal.Decimal values.
    +
    +    """
    +    __metaclass__ = PrimitiveTypeSingleton
    +
    +    def __repr__(self):
    +        return "DecimalType"
    +
    +class DoubleType(object):
    +    """Spark SQL DoubleType
    +
    +    The data type representing float values.
    +
    +    """
    +    __metaclass__ = PrimitiveTypeSingleton
    +
    +    def __repr__(self):
    +        return "DoubleType"
    +
    +class FloatType(object):
    +    """Spark SQL FloatType
    +
    +    For now, please use L{DoubleType} instead of using L{FloatType}.
    +    Because query evaluation is done in Scala, java.lang.Double will be be used
    +    for Python float numbers. Because the underlying JVM type of FloatType is
    +    java.lang.Float (in Java) and Float (in scala), there will be a java.lang.ClassCastException
    +    if FloatType (Python) is used.
    +
    +    """
    +    __metaclass__ = PrimitiveTypeSingleton
    +
    +    def __repr__(self):
    +        return "FloatType"
    +
    +class ByteType(object):
    +    """Spark SQL ByteType
    +
    +    For now, please use L{IntegerType} instead of using L{ByteType}.
    +    Because query evaluation is done in Scala, java.lang.Integer will be be used
    +    for Python int numbers. Because the underlying JVM type of ByteType is
    +    java.lang.Byte (in Java) and Byte (in scala), there will be a java.lang.ClassCastException
    +    if ByteType (Python) is used.
    +
    +    """
    +    __metaclass__ = PrimitiveTypeSingleton
    +
    +    def __repr__(self):
    +        return "ByteType"
    +
    +class IntegerType(object):
    +    """Spark SQL IntegerType
    +
    +    The data type representing int values.
    +
    +    """
    +    __metaclass__ = PrimitiveTypeSingleton
    +
    +    def __repr__(self):
    +        return "IntegerType"
    +
    +class LongType(object):
    +    """Spark SQL LongType
    +
    +    The data type representing long values. If the any value is beyond the range of
    +    [-9223372036854775808, 9223372036854775807], please use DecimalType.
    +
    +    """
    +    __metaclass__ = PrimitiveTypeSingleton
    +
    +    def __repr__(self):
    +        return "LongType"
    +
    +class ShortType(object):
    +    """Spark SQL ShortType
    +
    +    For now, please use L{IntegerType} instead of using L{ShortType}.
    --- End diff --
    
    JsonRDD already has this kind of conversions. I am not sure we want to do the conversions in Java and Scala. In Scala and Java, users can actually use `Short`, `Byte`, and `Float` values.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

Posted by chenghao-intel <gi...@git.apache.org>.

Github user chenghao-intel commented on the pull request:

    https://github.com/apache/spark/pull/1346#issuecomment-50451426
  
    @yhuai can you describe a little more about the `containsNull` for `ArrayType` and `MapType`, in my understanding, `Map` and `Array` contains null in most of cases during the runtime, why not just keep the previous implementation? Will that be something wrong when producing the RDD schema if the constraint not added?
    
    Sorry, if I missed some discussion here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1346#discussion_r15548865
  
    --- Diff: python/pyspark/sql.py ---
    @@ -20,8 +20,457 @@
     
     from py4j.protocol import Py4JError
     
    -__all__ = ["SQLContext", "HiveContext", "LocalHiveContext", "TestHiveContext", "SchemaRDD", "Row"]
    +__all__ = [
    +    "StringType", "BinaryType", "BooleanType", "TimestampType", "DecimalType",
    +    "DoubleType", "FloatType", "ByteType", "IntegerType", "LongType",
    +    "ShortType", "ArrayType", "MapType", "StructField", "StructType",
    +    "SQLContext", "HiveContext", "LocalHiveContext", "TestHiveContext", "SchemaRDD", "Row"]
     
    +class PrimitiveTypeSingleton(type):
    +    _instances = {}
    +    def __call__(cls):
    +        if cls not in cls._instances:
    +            cls._instances[cls] = super(PrimitiveTypeSingleton, cls).__call__()
    +        return cls._instances[cls]
    +
    +class StringType(object):
    +    """Spark SQL StringType
    +
    +    The data type representing string values.
    +
    +    """
    +    __metaclass__ = PrimitiveTypeSingleton
    +
    +    def __repr__(self):
    +        return "StringType"
    +
    +class BinaryType(object):
    +    """Spark SQL BinaryType
    +
    +    The data type representing bytearray values.
    +
    +    """
    +    __metaclass__ = PrimitiveTypeSingleton
    +
    +    def __repr__(self):
    +        return "BinaryType"
    +
    +class BooleanType(object):
    +    """Spark SQL BooleanType
    +
    +    The data type representing bool values.
    +
    +    """
    +    __metaclass__ = PrimitiveTypeSingleton
    +
    +    def __repr__(self):
    +        return "BooleanType"
    +
    +class TimestampType(object):
    +    """Spark SQL TimestampType
    +
    +    The data type representing datetime.datetime values.
    +
    +    """
    +    __metaclass__ = PrimitiveTypeSingleton
    +
    +    def __repr__(self):
    +        return "TimestampType"
    +
    +class DecimalType(object):
    +    """Spark SQL DecimalType
    +
    +    The data type representing decimal.Decimal values.
    +
    +    """
    +    __metaclass__ = PrimitiveTypeSingleton
    +
    +    def __repr__(self):
    +        return "DecimalType"
    +
    +class DoubleType(object):
    +    """Spark SQL DoubleType
    +
    +    The data type representing float values.
    +
    +    """
    +    __metaclass__ = PrimitiveTypeSingleton
    +
    +    def __repr__(self):
    +        return "DoubleType"
    +
    +class FloatType(object):
    +    """Spark SQL FloatType
    +
    +    For now, please use L{DoubleType} instead of using L{FloatType}.
    +    Because query evaluation is done in Scala, java.lang.Double will be be used
    +    for Python float numbers. Because the underlying JVM type of FloatType is
    +    java.lang.Float (in Java) and Float (in scala), there will be a java.lang.ClassCastException
    +    if FloatType (Python) is used.
    +
    +    """
    +    __metaclass__ = PrimitiveTypeSingleton
    +
    +    def __repr__(self):
    +        return "FloatType"
    +
    +class ByteType(object):
    +    """Spark SQL ByteType
    +
    +    For now, please use L{IntegerType} instead of using L{ByteType}.
    +    Because query evaluation is done in Scala, java.lang.Integer will be be used
    +    for Python int numbers. Because the underlying JVM type of ByteType is
    +    java.lang.Byte (in Java) and Byte (in scala), there will be a java.lang.ClassCastException
    +    if ByteType (Python) is used.
    +
    +    """
    +    __metaclass__ = PrimitiveTypeSingleton
    +
    +    def __repr__(self):
    +        return "ByteType"
    +
    +class IntegerType(object):
    +    """Spark SQL IntegerType
    +
    +    The data type representing int values.
    +
    +    """
    +    __metaclass__ = PrimitiveTypeSingleton
    +
    +    def __repr__(self):
    +        return "IntegerType"
    +
    +class LongType(object):
    +    """Spark SQL LongType
    +
    +    The data type representing long values. If the any value is beyond the range of
    +    [-9223372036854775808, 9223372036854775807], please use DecimalType.
    +
    +    """
    +    __metaclass__ = PrimitiveTypeSingleton
    +
    +    def __repr__(self):
    +        return "LongType"
    +
    +class ShortType(object):
    +    """Spark SQL ShortType
    +
    +    For now, please use L{IntegerType} instead of using L{ShortType}.
    --- End diff --
    
    We could also convert the type to the correct type on the way in from python.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [WIP][SPARK-2179][SQL] Public API for DataType...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1346#issuecomment-48694027
  
    QA results for PR 1346:<br>- This patch FAILED unit tests.<br>- This patch merges cleanly<br>- This patch adds the following public classes (experimental):<br>case class ArrayType(elementType: DataType) extends DataType {<br>case class StructField(name: String, dataType: DataType, nullable: Boolean) {<br>case class MapType(keyType: DataType, valueType: DataType) extends DataType {<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16547/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1346#discussion_r15549991
  
    --- Diff: python/pyspark/sql.py ---
    @@ -20,8 +20,457 @@
     
     from py4j.protocol import Py4JError
     
    -__all__ = ["SQLContext", "HiveContext", "LocalHiveContext", "TestHiveContext", "SchemaRDD", "Row"]
    +__all__ = [
    +    "StringType", "BinaryType", "BooleanType", "TimestampType", "DecimalType",
    +    "DoubleType", "FloatType", "ByteType", "IntegerType", "LongType",
    +    "ShortType", "ArrayType", "MapType", "StructField", "StructType",
    +    "SQLContext", "HiveContext", "LocalHiveContext", "TestHiveContext", "SchemaRDD", "Row"]
     
    +class PrimitiveTypeSingleton(type):
    +    _instances = {}
    +    def __call__(cls):
    +        if cls not in cls._instances:
    +            cls._instances[cls] = super(PrimitiveTypeSingleton, cls).__call__()
    +        return cls._instances[cls]
    +
    +class StringType(object):
    +    """Spark SQL StringType
    +
    +    The data type representing string values.
    +
    +    """
    +    __metaclass__ = PrimitiveTypeSingleton
    +
    +    def __repr__(self):
    +        return "StringType"
    +
    +class BinaryType(object):
    +    """Spark SQL BinaryType
    +
    +    The data type representing bytearray values.
    +
    +    """
    +    __metaclass__ = PrimitiveTypeSingleton
    +
    +    def __repr__(self):
    +        return "BinaryType"
    +
    +class BooleanType(object):
    +    """Spark SQL BooleanType
    +
    +    The data type representing bool values.
    +
    +    """
    +    __metaclass__ = PrimitiveTypeSingleton
    +
    +    def __repr__(self):
    +        return "BooleanType"
    +
    +class TimestampType(object):
    +    """Spark SQL TimestampType
    +
    +    The data type representing datetime.datetime values.
    +
    +    """
    +    __metaclass__ = PrimitiveTypeSingleton
    +
    +    def __repr__(self):
    +        return "TimestampType"
    +
    +class DecimalType(object):
    +    """Spark SQL DecimalType
    +
    +    The data type representing decimal.Decimal values.
    +
    +    """
    +    __metaclass__ = PrimitiveTypeSingleton
    +
    +    def __repr__(self):
    +        return "DecimalType"
    +
    +class DoubleType(object):
    +    """Spark SQL DoubleType
    +
    +    The data type representing float values.
    +
    +    """
    +    __metaclass__ = PrimitiveTypeSingleton
    +
    +    def __repr__(self):
    +        return "DoubleType"
    +
    +class FloatType(object):
    +    """Spark SQL FloatType
    +
    +    For now, please use L{DoubleType} instead of using L{FloatType}.
    +    Because query evaluation is done in Scala, java.lang.Double will be be used
    +    for Python float numbers. Because the underlying JVM type of FloatType is
    +    java.lang.Float (in Java) and Float (in scala), there will be a java.lang.ClassCastException
    +    if FloatType (Python) is used.
    +
    +    """
    +    __metaclass__ = PrimitiveTypeSingleton
    +
    +    def __repr__(self):
    +        return "FloatType"
    +
    +class ByteType(object):
    +    """Spark SQL ByteType
    +
    +    For now, please use L{IntegerType} instead of using L{ByteType}.
    +    Because query evaluation is done in Scala, java.lang.Integer will be be used
    +    for Python int numbers. Because the underlying JVM type of ByteType is
    +    java.lang.Byte (in Java) and Byte (in scala), there will be a java.lang.ClassCastException
    +    if ByteType (Python) is used.
    +
    +    """
    +    __metaclass__ = PrimitiveTypeSingleton
    +
    +    def __repr__(self):
    +        return "ByteType"
    +
    +class IntegerType(object):
    +    """Spark SQL IntegerType
    +
    +    The data type representing int values.
    +
    +    """
    +    __metaclass__ = PrimitiveTypeSingleton
    +
    +    def __repr__(self):
    +        return "IntegerType"
    +
    +class LongType(object):
    +    """Spark SQL LongType
    +
    +    The data type representing long values. If the any value is beyond the range of
    +    [-9223372036854775808, 9223372036854775807], please use DecimalType.
    +
    +    """
    +    __metaclass__ = PrimitiveTypeSingleton
    +
    +    def __repr__(self):
    +        return "LongType"
    +
    +class ShortType(object):
    +    """Spark SQL ShortType
    +
    +    For now, please use L{IntegerType} instead of using L{ShortType}.
    --- End diff --
    
    After converting to expected type, it also will reduce the memory cost when cached.
    
    When calling applySchema(), we should check the datatype and schema, do this convertion automatically, also in Java and Scala, and JsonRDD.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [WIP][SPARK-2179][SQL] Public API for DataType...

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1346#discussion_r14904749
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala ---
    @@ -102,15 +128,23 @@ class SQLContext(@transient val sparkContext: SparkContext)
        *
        * @group userf
        */
    -  def jsonFile(path: String): SchemaRDD = jsonFile(path, 1.0)
    +  def jsonFile(path: String): SchemaRDD = jsonFile(path, 1.0, None)
    +
    +  /**
    +   * Loads a JSON file (one object per line) and applies the given schema,
    +   * returning the result as a [[SchemaRDD]].
    +   *
    +   * @group userf
    +   */
    +  def jsonFile(path: String, schema: StructType): SchemaRDD = jsonFile(path, 1.0, Option(schema))
     
       /**
        * :: Experimental ::
        */
       @Experimental
    -  def jsonFile(path: String, samplingRatio: Double): SchemaRDD = {
    +  def jsonFile(path: String, samplingRatio: Double, schema: Option[StructType]): SchemaRDD = {
    --- End diff --
    
    We need to be careful here to avoid removing old API functions if we don't have to (i.e. I think we are dropping `jsonFile(path, sampling)`.  Furthermore, just using default arguments is not enough to preserve binary compatibility.  
    
    Also, when would you ever want to specify both the samplingRatio and a schema?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [WIP][SPARK-2179][SQL] Public API for DataType...

Posted by yhuai <gi...@git.apache.org>.

Github user yhuai commented on the pull request:

    https://github.com/apache/spark/pull/1346#issuecomment-50290941
  
    @davies I have added the Python APIs. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

Posted by yhuai <gi...@git.apache.org>.

Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1346#discussion_r15482947
  
    --- Diff: python/pyspark/sql.py ---
    @@ -107,6 +512,25 @@ def inferSchema(self, rdd):
             srdd = self._ssql_ctx.inferSchema(jrdd.rdd())
             return SchemaRDD(srdd, self)
     
    +    def applySchema(self, rdd, schema):
    +        """Applies the given schema to the given RDD of L{dict}s.
    --- End diff --
    
    Right, @davies will make the change in his PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [WIP][SPARK-2179][SQL] Public API for DataType...

Posted by yhuai <gi...@git.apache.org>.

Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1346#discussion_r14777604
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SchemaRDDLike.scala ---
    @@ -123,9 +123,12 @@ private[sql] trait SchemaRDDLike {
       def saveAsTable(tableName: String): Unit =
         sqlContext.executePlan(InsertIntoCreatedTable(None, tableName, logicalPlan)).toRdd
     
    +  /** Returns the schema. */
    +  def schema: StructType = queryExecution.analyzed.schema
    +
       /** Returns the output schema in the tree format. */
    -  def schemaString: String = queryExecution.analyzed.schemaString
    +  def formattedSchemaString: String = schema.formattedSchemaString
    --- End diff --
    
    We do not have to change it. I was thinking probably users have not noticed this API and `formattedSchemaString` may be a more meaningful name. I can change it back if you think using the current name is better. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [WIP][SPARK-2179][SQL] Public API for DataType...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/1346#issuecomment-48535200
  
    All automated tests passed.
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16474/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1346#issuecomment-50420304
  
    QA results for PR 1346:<br>- This patch FAILED unit tests.<br>- This patch merges cleanly<br>- This patch adds no public classes<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17319/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1346#discussion_r15481080
  
    --- Diff: python/pyspark/sql.py ---
    @@ -20,8 +20,413 @@
     
     from py4j.protocol import Py4JError
     
    -__all__ = ["SQLContext", "HiveContext", "LocalHiveContext", "TestHiveContext", "SchemaRDD", "Row"]
    +__all__ = [
    +    "StringType", "BinaryType", "BooleanType", "DecimalType", "DoubleType",
    +    "FloatType", "ByteType", "IntegerType", "LongType", "ShortType",
    +    "ArrayType", "MapType", "StructField", "StructType",
    +    "SQLContext", "HiveContext", "LocalHiveContext", "TestHiveContext", "SchemaRDD", "Row"]
     
    +class PrimitiveTypeSingleton(type):
    +    _instances = {}
    +    def __call__(cls):
    +        if cls not in cls._instances:
    +            cls._instances[cls] = super(PrimitiveTypeSingleton, cls).__call__()
    +        return cls._instances[cls]
    +
    +class StringType(object):
    +    """Spark SQL StringType
    +
    +    The data type representing string values.
    +
    +    """
    +    __metaclass__ = PrimitiveTypeSingleton
    +
    +    def _get_scala_type_string(self):
    +        return "StringType"
    +
    +class BinaryType(object):
    +    """Spark SQL BinaryType
    +
    +    The data type representing bytes values and bytearray values.
    --- End diff --
    
    We probably just want to say byte arrays here since we have a separate type for byte.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

Posted by yhuai <gi...@git.apache.org>.

Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/1346#discussion_r15499885

--- Diff: sql/core/src/main/java/org/apache/spark/sql/api/java/types/package-info.java ---
@@ -0,0 +1,22 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+
+/**
+ * Allows users to get and create Spark SQL data types.
+ */
+package org.apache.spark.sql.api.java.types;
--- End diff --

Done.

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

Posted by yhuai <gi...@git.apache.org>.

Github user yhuai commented on the pull request:

    https://github.com/apache/spark/pull/1346#issuecomment-50424563
  
    @concretevitamin There is another way create a row, which is `Row.fromSeq(values: Seq[Any])`. Or, you can expand the array by using `:_*`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [WIP][SPARK-2179][SQL] Public API for DataType...

Posted by yhuai <gi...@git.apache.org>.

Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1346#discussion_r14906218
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala ---
    @@ -102,15 +128,23 @@ class SQLContext(@transient val sparkContext: SparkContext)
        *
        * @group userf
        */
    -  def jsonFile(path: String): SchemaRDD = jsonFile(path, 1.0)
    +  def jsonFile(path: String): SchemaRDD = jsonFile(path, 1.0, None)
    +
    +  /**
    +   * Loads a JSON file (one object per line) and applies the given schema,
    +   * returning the result as a [[SchemaRDD]].
    +   *
    +   * @group userf
    +   */
    +  def jsonFile(path: String, schema: StructType): SchemaRDD = jsonFile(path, 1.0, Option(schema))
     
       /**
        * :: Experimental ::
        */
       @Experimental
    -  def jsonFile(path: String, samplingRatio: Double): SchemaRDD = {
    +  def jsonFile(path: String, samplingRatio: Double, schema: Option[StructType]): SchemaRDD = {
    --- End diff --
    
    Oh, right, it does not make sense to have both a samplingRatio and a schema.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [WIP][SPARK-2179][SQL] Public API for DataType...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/1346#issuecomment-48570378
  
    @yhuai I haven't looked at the changes yet, but can you make sure the end API is usable in Java?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

Posted by yhuai <gi...@git.apache.org>.

Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1346#discussion_r15548770
  
    --- Diff: python/pyspark/sql.py ---
    @@ -20,8 +20,457 @@
     
     from py4j.protocol import Py4JError
     
    -__all__ = ["SQLContext", "HiveContext", "LocalHiveContext", "TestHiveContext", "SchemaRDD", "Row"]
    +__all__ = [
    +    "StringType", "BinaryType", "BooleanType", "TimestampType", "DecimalType",
    +    "DoubleType", "FloatType", "ByteType", "IntegerType", "LongType",
    +    "ShortType", "ArrayType", "MapType", "StructField", "StructType",
    +    "SQLContext", "HiveContext", "LocalHiveContext", "TestHiveContext", "SchemaRDD", "Row"]
     
    +class PrimitiveTypeSingleton(type):
    +    _instances = {}
    +    def __call__(cls):
    +        if cls not in cls._instances:
    +            cls._instances[cls] = super(PrimitiveTypeSingleton, cls).__call__()
    +        return cls._instances[cls]
    +
    +class StringType(object):
    +    """Spark SQL StringType
    +
    +    The data type representing string values.
    +
    +    """
    +    __metaclass__ = PrimitiveTypeSingleton
    +
    +    def __repr__(self):
    +        return "StringType"
    +
    +class BinaryType(object):
    +    """Spark SQL BinaryType
    +
    +    The data type representing bytearray values.
    +
    +    """
    +    __metaclass__ = PrimitiveTypeSingleton
    +
    +    def __repr__(self):
    +        return "BinaryType"
    +
    +class BooleanType(object):
    +    """Spark SQL BooleanType
    +
    +    The data type representing bool values.
    +
    +    """
    +    __metaclass__ = PrimitiveTypeSingleton
    +
    +    def __repr__(self):
    +        return "BooleanType"
    +
    +class TimestampType(object):
    +    """Spark SQL TimestampType
    +
    +    The data type representing datetime.datetime values.
    +
    +    """
    +    __metaclass__ = PrimitiveTypeSingleton
    +
    +    def __repr__(self):
    +        return "TimestampType"
    +
    +class DecimalType(object):
    +    """Spark SQL DecimalType
    +
    +    The data type representing decimal.Decimal values.
    +
    +    """
    +    __metaclass__ = PrimitiveTypeSingleton
    +
    +    def __repr__(self):
    +        return "DecimalType"
    +
    +class DoubleType(object):
    +    """Spark SQL DoubleType
    +
    +    The data type representing float values.
    +
    +    """
    +    __metaclass__ = PrimitiveTypeSingleton
    +
    +    def __repr__(self):
    +        return "DoubleType"
    +
    +class FloatType(object):
    +    """Spark SQL FloatType
    +
    +    For now, please use L{DoubleType} instead of using L{FloatType}.
    +    Because query evaluation is done in Scala, java.lang.Double will be be used
    +    for Python float numbers. Because the underlying JVM type of FloatType is
    +    java.lang.Float (in Java) and Float (in scala), there will be a java.lang.ClassCastException
    +    if FloatType (Python) is used.
    +
    +    """
    +    __metaclass__ = PrimitiveTypeSingleton
    +
    +    def __repr__(self):
    +        return "FloatType"
    +
    +class ByteType(object):
    +    """Spark SQL ByteType
    +
    +    For now, please use L{IntegerType} instead of using L{ByteType}.
    +    Because query evaluation is done in Scala, java.lang.Integer will be be used
    +    for Python int numbers. Because the underlying JVM type of ByteType is
    +    java.lang.Byte (in Java) and Byte (in scala), there will be a java.lang.ClassCastException
    +    if ByteType (Python) is used.
    +
    +    """
    +    __metaclass__ = PrimitiveTypeSingleton
    +
    +    def __repr__(self):
    +        return "ByteType"
    +
    +class IntegerType(object):
    +    """Spark SQL IntegerType
    +
    +    The data type representing int values.
    +
    +    """
    +    __metaclass__ = PrimitiveTypeSingleton
    +
    +    def __repr__(self):
    +        return "IntegerType"
    +
    +class LongType(object):
    +    """Spark SQL LongType
    +
    +    The data type representing long values. If the any value is beyond the range of
    +    [-9223372036854775808, 9223372036854775807], please use DecimalType.
    +
    +    """
    +    __metaclass__ = PrimitiveTypeSingleton
    +
    +    def __repr__(self):
    +        return "LongType"
    +
    +class ShortType(object):
    +    """Spark SQL ShortType
    +
    +    For now, please use L{IntegerType} instead of using L{ShortType}.
    --- End diff --
    
    If we have a ShortType column, the expression evaluator will try to cast it as a `Short` (`asInstanceOf[Short]`). However, the cast will fail because the data is `java.lang.Integer`. I will add more doc.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

Posted by chutium <gi...@git.apache.org>.

Github user chutium commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1346#discussion_r15862768
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala ---
    @@ -89,6 +88,44 @@ class SQLContext(@transient val sparkContext: SparkContext)
         new SchemaRDD(this, SparkLogicalPlan(ExistingRdd.fromProductRdd(rdd))(self))
     
       /**
    +   * :: DeveloperApi ::
    +   * Creates a [[SchemaRDD]] from an [[RDD]] containing [[Row]]s by applying a schema to this RDD.
    +   * It is important to make sure that the structure of every [[Row]] of the provided RDD matches
    +   * the provided schema. Otherwise, there will be runtime exception.
    +   * Example:
    +   * {{{
    +   *  import org.apache.spark.sql._
    +   *  val sqlContext = new org.apache.spark.sql.SQLContext(sc)
    +   *
    +   *  val schema =
    +   *    StructType(
    +   *      StructField("name", StringType, false) ::
    +   *      StructField("age", IntegerType, true) :: Nil)
    +   *
    --- End diff --
    
    good, i merged the change and used this API ```applySchema(rowRDD, appliedSchema)``` in #1612


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

Posted by yhuai <gi...@git.apache.org>.

Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1346#discussion_r15767384
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala ---
    @@ -89,6 +88,44 @@ class SQLContext(@transient val sparkContext: SparkContext)
         new SchemaRDD(this, SparkLogicalPlan(ExistingRdd.fromProductRdd(rdd))(self))
     
       /**
    +   * :: DeveloperApi ::
    +   * Creates a [[SchemaRDD]] from an [[RDD]] containing [[Row]]s by applying a schema to this RDD.
    +   * It is important to make sure that the structure of every [[Row]] of the provided RDD matches
    +   * the provided schema. Otherwise, there will be runtime exception.
    +   * Example:
    +   * {{{
    +   *  import org.apache.spark.sql._
    +   *  val sqlContext = new org.apache.spark.sql.SQLContext(sc)
    +   *
    +   *  val schema =
    +   *    StructType(
    +   *      StructField("name", StringType, false) ::
    +   *      StructField("age", IntegerType, true) :: Nil)
    +   *
    --- End diff --
    
    For the completeness of our data types, we need `StructType` (`Seq[StructField]` is not a data type). For example, if the type of a filed is a struct, we need to have a way to describe that the type of this field is a struct. Also, because a row is basically a struct value, it is natural to use `StructType` to represent a schema.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [WIP][SPARK-2179][SQL] Public API for DataType...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1346#issuecomment-48690410
  
    QA tests have started for PR 1346. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16547/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1346#issuecomment-50443467
  
    QA tests have started for PR 1346. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17344/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [WIP][SPARK-2179][SQL] Public API for DataType...

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1346#discussion_r14903645
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala ---
    @@ -85,6 +85,26 @@ object ScalaReflection {
         case t if t <:< definitions.BooleanTpe => Schema(BooleanType, nullable = false)
       }
     
    +  def typeOfObject: PartialFunction[Any, DataType] = {
    +    // The data type can be determined without ambiguity.
    +    case obj: BooleanType.JvmType => BooleanType
    +    case obj: BinaryType.JvmType => BinaryType
    +    case obj: StringType.JvmType => StringType
    +    case obj: ByteType.JvmType => ByteType
    +    case obj: ShortType.JvmType => ShortType
    +    case obj: IntegerType.JvmType => IntegerType
    +    case obj: LongType.JvmType => LongType
    +    case obj: FloatType.JvmType => FloatType
    +    case obj: DoubleType.JvmType => DoubleType
    +    case obj: DecimalType.JvmType => DecimalType
    +    case obj: TimestampType.JvmType => TimestampType
    +    case null => NullType
    +    // For other cases, there is no obvious mapping from the type of the given object to a
    --- End diff --
    
    Perhaps this should go in scaladoc for this partial function.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1346#discussion_r15510908
  
    --- Diff: python/pyspark/sql.py ---
    @@ -20,8 +20,457 @@
     
     from py4j.protocol import Py4JError
     
    -__all__ = ["SQLContext", "HiveContext", "LocalHiveContext", "TestHiveContext", "SchemaRDD", "Row"]
    +__all__ = [
    +    "StringType", "BinaryType", "BooleanType", "TimestampType", "DecimalType",
    +    "DoubleType", "FloatType", "ByteType", "IntegerType", "LongType",
    +    "ShortType", "ArrayType", "MapType", "StructField", "StructType",
    +    "SQLContext", "HiveContext", "LocalHiveContext", "TestHiveContext", "SchemaRDD", "Row"]
     
    +class PrimitiveTypeSingleton(type):
    +    _instances = {}
    +    def __call__(cls):
    +        if cls not in cls._instances:
    +            cls._instances[cls] = super(PrimitiveTypeSingleton, cls).__call__()
    +        return cls._instances[cls]
    +
    +class StringType(object):
    --- End diff --
    
    I think PEP8 requires two blank lines to separate top level classes.
    
    Better run the pep8 checker on files changed by this PR since most other files are now pep8 clean, and we will add a pep8 checker to jenkins. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [WIP][SPARK-2179][SQL] Public API for DataType...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1346#issuecomment-50107533
  
    QA tests have started for PR 1346. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17160/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1346#discussion_r15511259
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala ---
    @@ -89,6 +90,45 @@ class SQLContext(@transient val sparkContext: SparkContext)
         new SchemaRDD(this, SparkLogicalPlan(ExistingRdd.fromProductRdd(rdd)))
     
       /**
    +   * :: DeveloperApi ::
    +   * Creates a [[SchemaRDD]] from an [[RDD]] containing [[Row]]s by applying a schema to this RDD.
    +   * It is important to make sure that the structure of every [[Row]] of the provided RDD matches
    +   * the provided schema. Otherwise, there will be runtime exception.
    +   *
    +   * @group userf
    --- End diff --
    
    would be great to give an inline example. just wrap it with 
    ```scala
    {{{
      // example code here
    }}}
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1346#discussion_r15511092
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/types/dataTypes.scala ---
    @@ -201,47 +231,139 @@ object FractionalType {
       }
     }
     abstract class FractionalType extends NumericType {
    -  val fractional: Fractional[JvmType]
    +  private[sql] val fractional: Fractional[JvmType]
     }
     
     case object DecimalType extends FractionalType {
    -  type JvmType = BigDecimal
    -  @transient lazy val tag = ScalaReflectionLock.synchronized { typeTag[JvmType] }
    -  val numeric = implicitly[Numeric[BigDecimal]]
    -  val fractional = implicitly[Fractional[BigDecimal]]
    -  val ordering = implicitly[Ordering[JvmType]]
    +  private[sql] type JvmType = BigDecimal
    +  @transient private[sql] lazy val tag = ScalaReflectionLock.synchronized { typeTag[JvmType] }
    +  private[sql] val numeric = implicitly[Numeric[BigDecimal]]
    +  private[sql] val fractional = implicitly[Fractional[BigDecimal]]
    +  private[sql] val ordering = implicitly[Ordering[JvmType]]
    +  def simpleString: String = "decimal"
     }
     
     case object DoubleType extends FractionalType {
    -  type JvmType = Double
    -  @transient lazy val tag = ScalaReflectionLock.synchronized { typeTag[JvmType] }
    -  val numeric = implicitly[Numeric[Double]]
    -  val fractional = implicitly[Fractional[Double]]
    -  val ordering = implicitly[Ordering[JvmType]]
    +  private[sql] type JvmType = Double
    +  @transient private[sql] lazy val tag = ScalaReflectionLock.synchronized { typeTag[JvmType] }
    +  private[sql] val numeric = implicitly[Numeric[Double]]
    +  private[sql] val fractional = implicitly[Fractional[Double]]
    +  private[sql] val ordering = implicitly[Ordering[JvmType]]
    +  def simpleString: String = "double"
     }
     
     case object FloatType extends FractionalType {
    -  type JvmType = Float
    -  @transient lazy val tag = ScalaReflectionLock.synchronized { typeTag[JvmType] }
    -  val numeric = implicitly[Numeric[Float]]
    -  val fractional = implicitly[Fractional[Float]]
    -  val ordering = implicitly[Ordering[JvmType]]
    +  private[sql] type JvmType = Float
    +  @transient private[sql] lazy val tag = ScalaReflectionLock.synchronized { typeTag[JvmType] }
    +  private[sql] val numeric = implicitly[Numeric[Float]]
    +  private[sql] val fractional = implicitly[Fractional[Float]]
    +  private[sql] val ordering = implicitly[Ordering[JvmType]]
    +  def simpleString: String = "float"
    +}
    +
    +object ArrayType {
    +  /** Construct a [[ArrayType]] object with the given element type. The `containsNull` is false. */
    +  def apply(elementType: DataType): ArrayType = ArrayType(elementType, false)
    +}
    +
    +case class ArrayType(elementType: DataType, containsNull: Boolean) extends DataType {
    +  private[sql] def buildFormattedString(prefix: String, builder: StringBuilder): Unit = {
    +    builder.append(
    +      s"${prefix}-- element: ${elementType.simpleString} (containsNull = ${containsNull})\n")
    +    DataType.buildFormattedString(elementType, s"$prefix    |", builder)
    +  }
    +
    +  def simpleString: String = "array"
     }
     
    -case class ArrayType(elementType: DataType) extends DataType
    +case class StructField(name: String, dataType: DataType, nullable: Boolean) {
    --- End diff --
    
    Add scaladoc to define the semantics of nullable (nullable keys vs nullable values vs both)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

Posted by yhuai <gi...@git.apache.org>.

Github user yhuai commented on the pull request:

    https://github.com/apache/spark/pull/1346#issuecomment-50649902
  
    Maven build is failing. https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/hadoop.version=2.0.0-mr1-cdh4.1.2,label=centos/244/console I am look at it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1346#issuecomment-50420253
  
    QA tests have started for PR 1346. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17319/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1346#discussion_r15483710
  
    --- Diff: python/pyspark/sql.py ---
    @@ -20,8 +20,413 @@
     
     from py4j.protocol import Py4JError
     
    -__all__ = ["SQLContext", "HiveContext", "LocalHiveContext", "TestHiveContext", "SchemaRDD", "Row"]
    +__all__ = [
    +    "StringType", "BinaryType", "BooleanType", "DecimalType", "DoubleType",
    +    "FloatType", "ByteType", "IntegerType", "LongType", "ShortType",
    +    "ArrayType", "MapType", "StructField", "StructType",
    +    "SQLContext", "HiveContext", "LocalHiveContext", "TestHiveContext", "SchemaRDD", "Row"]
     
    +class PrimitiveTypeSingleton(type):
    +    _instances = {}
    +    def __call__(cls):
    +        if cls not in cls._instances:
    +            cls._instances[cls] = super(PrimitiveTypeSingleton, cls).__call__()
    +        return cls._instances[cls]
    +
    +class StringType(object):
    +    """Spark SQL StringType
    +
    +    The data type representing string values.
    +
    +    """
    +    __metaclass__ = PrimitiveTypeSingleton
    +
    +    def _get_scala_type_string(self):
    +        return "StringType"
    +
    +class BinaryType(object):
    +    """Spark SQL BinaryType
    +
    +    The data type representing bytes values and bytearray values.
    +
    +    """
    +    __metaclass__ = PrimitiveTypeSingleton
    +
    +    def _get_scala_type_string(self):
    +        return "BinaryType"
    +
    +class BooleanType(object):
    +    """Spark SQL BooleanType
    +
    +    The data type representing bool values.
    +
    +    """
    +    __metaclass__ = PrimitiveTypeSingleton
    +
    +    def _get_scala_type_string(self):
    +        return "BooleanType"
    +
    +class TimestampType(object):
    +    """Spark SQL TimestampType"""
    +    __metaclass__ = PrimitiveTypeSingleton
    +
    +    def _get_scala_type_string(self):
    +        return "TimestampType"
    +
    +class DecimalType(object):
    +    """Spark SQL DecimalType
    +
    +    The data type representing decimal.Decimal values.
    +
    +    """
    +    __metaclass__ = PrimitiveTypeSingleton
    +
    +    def _get_scala_type_string(self):
    +        return "DecimalType"
    +
    +class DoubleType(object):
    +    """Spark SQL DoubleType
    +
    +    The data type representing float values. Because a float value
    +
    +    """
    +    __metaclass__ = PrimitiveTypeSingleton
    +
    +    def _get_scala_type_string(self):
    +        return "DoubleType"
    +
    +class FloatType(object):
    +    """Spark SQL FloatType
    +
    +    For PySpark, please use L{DoubleType} instead of using L{FloatType}.
    +
    +    """
    +    __metaclass__ = PrimitiveTypeSingleton
    +
    +    def _get_scala_type_string(self):
    +        return "FloatType"
    +
    +class ByteType(object):
    +    """Spark SQL ByteType
    +
    +    For PySpark, please use L{IntegerType} instead of using L{ByteType}.
    +
    +    """
    +    __metaclass__ = PrimitiveTypeSingleton
    +
    +    def _get_scala_type_string(self):
    +        return "ByteType"
    +
    +class IntegerType(object):
    +    """Spark SQL IntegerType
    +
    +    The data type representing int values.
    +
    +    """
    +    __metaclass__ = PrimitiveTypeSingleton
    +
    +    def _get_scala_type_string(self):
    +        return "IntegerType"
    +
    +class LongType(object):
    +    """Spark SQL LongType
    +
    +    The data type representing long values. If the any value is beyond the range of
    +    [-9223372036854775808, 9223372036854775807], please use DecimalType.
    +
    +    """
    +    __metaclass__ = PrimitiveTypeSingleton
    +
    +    def _get_scala_type_string(self):
    +        return "LongType"
    +
    +class ShortType(object):
    +    """Spark SQL ShortType
    +
    +    For PySpark, please use L{IntegerType} instead of using L{ShortType}.
    +
    +    """
    +    __metaclass__ = PrimitiveTypeSingleton
    +
    +    def _get_scala_type_string(self):
    +        return "ShortType"
    +
    +class ArrayType(object):
    +    """Spark SQL ArrayType
    +
    +    The data type representing list values.
    +
    +    """
    +    def __init__(self, elementType, containsNull):
    --- End diff --
    
    Should we have the same default value for containsNull that we have in Scala?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [WIP][SPARK-2179][SQL] Public API for DataType...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1346#issuecomment-48694858
  
    QA tests have started for PR 1346. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16553/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1346#discussion_r15510935
  
    --- Diff: python/pyspark/sql.py ---
    @@ -20,8 +20,457 @@
     
     from py4j.protocol import Py4JError
     
    -__all__ = ["SQLContext", "HiveContext", "LocalHiveContext", "TestHiveContext", "SchemaRDD", "Row"]
    +__all__ = [
    +    "StringType", "BinaryType", "BooleanType", "TimestampType", "DecimalType",
    +    "DoubleType", "FloatType", "ByteType", "IntegerType", "LongType",
    +    "ShortType", "ArrayType", "MapType", "StructField", "StructType",
    +    "SQLContext", "HiveContext", "LocalHiveContext", "TestHiveContext", "SchemaRDD", "Row"]
     
    +class PrimitiveTypeSingleton(type):
    +    _instances = {}
    +    def __call__(cls):
    +        if cls not in cls._instances:
    +            cls._instances[cls] = super(PrimitiveTypeSingleton, cls).__call__()
    +        return cls._instances[cls]
    +
    +class StringType(object):
    +    """Spark SQL StringType
    +
    +    The data type representing string values.
    +
    +    """
    +    __metaclass__ = PrimitiveTypeSingleton
    +
    +    def __repr__(self):
    +        return "StringType"
    +
    +class BinaryType(object):
    +    """Spark SQL BinaryType
    +
    +    The data type representing bytearray values.
    +
    +    """
    +    __metaclass__ = PrimitiveTypeSingleton
    +
    +    def __repr__(self):
    +        return "BinaryType"
    +
    +class BooleanType(object):
    +    """Spark SQL BooleanType
    +
    +    The data type representing bool values.
    +
    +    """
    +    __metaclass__ = PrimitiveTypeSingleton
    +
    +    def __repr__(self):
    +        return "BooleanType"
    +
    +class TimestampType(object):
    +    """Spark SQL TimestampType
    +
    +    The data type representing datetime.datetime values.
    +
    +    """
    +    __metaclass__ = PrimitiveTypeSingleton
    +
    +    def __repr__(self):
    +        return "TimestampType"
    +
    +class DecimalType(object):
    +    """Spark SQL DecimalType
    +
    +    The data type representing decimal.Decimal values.
    +
    +    """
    +    __metaclass__ = PrimitiveTypeSingleton
    +
    +    def __repr__(self):
    +        return "DecimalType"
    +
    +class DoubleType(object):
    +    """Spark SQL DoubleType
    +
    +    The data type representing float values.
    +
    +    """
    +    __metaclass__ = PrimitiveTypeSingleton
    +
    +    def __repr__(self):
    +        return "DoubleType"
    +
    +class FloatType(object):
    +    """Spark SQL FloatType
    +
    +    For now, please use L{DoubleType} instead of using L{FloatType}.
    +    Because query evaluation is done in Scala, java.lang.Double will be be used
    +    for Python float numbers. Because the underlying JVM type of FloatType is
    +    java.lang.Float (in Java) and Float (in scala), there will be a java.lang.ClassCastException
    +    if FloatType (Python) is used.
    +
    +    """
    +    __metaclass__ = PrimitiveTypeSingleton
    +
    +    def __repr__(self):
    +        return "FloatType"
    +
    +class ByteType(object):
    +    """Spark SQL ByteType
    +
    +    For now, please use L{IntegerType} instead of using L{ByteType}.
    +    Because query evaluation is done in Scala, java.lang.Integer will be be used
    +    for Python int numbers. Because the underlying JVM type of ByteType is
    +    java.lang.Byte (in Java) and Byte (in scala), there will be a java.lang.ClassCastException
    +    if ByteType (Python) is used.
    +
    +    """
    +    __metaclass__ = PrimitiveTypeSingleton
    +
    +    def __repr__(self):
    +        return "ByteType"
    +
    +class IntegerType(object):
    +    """Spark SQL IntegerType
    +
    +    The data type representing int values.
    +
    +    """
    +    __metaclass__ = PrimitiveTypeSingleton
    +
    +    def __repr__(self):
    +        return "IntegerType"
    +
    +class LongType(object):
    +    """Spark SQL LongType
    +
    +    The data type representing long values. If the any value is beyond the range of
    +    [-9223372036854775808, 9223372036854775807], please use DecimalType.
    +
    +    """
    +    __metaclass__ = PrimitiveTypeSingleton
    +
    +    def __repr__(self):
    +        return "LongType"
    +
    +class ShortType(object):
    +    """Spark SQL ShortType
    +
    +    For now, please use L{IntegerType} instead of using L{ShortType}.
    --- End diff --
    
    I don't get the problem after reading the comment here. Can you clarify?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [WIP][SPARK-2179][SQL] Public API for DataType...

Posted by ueshin <gi...@git.apache.org>.

Github user ueshin commented on the pull request:

    https://github.com/apache/spark/pull/1346#issuecomment-48566629
  
    @yhuai, I understand. Thank you for your reply.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [WIP][SPARK-2179][SQL] Public API for DataType...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1346#issuecomment-48694887
  
    QA results for PR 1346:<br>- This patch FAILED unit tests.<br>- This patch merges cleanly<br>- This patch adds the following public classes (experimental):<br>case class ArrayType(elementType: DataType) extends DataType {<br>case class StructField(name: String, dataType: DataType, nullable: Boolean) {<br>case class MapType(keyType: DataType, valueType: DataType) extends DataType {<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16553/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [WIP][SPARK-2179][SQL] Public API for DataType...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1346#discussion_r14754724
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SchemaRDDLike.scala ---
    @@ -123,9 +123,12 @@ private[sql] trait SchemaRDDLike {
       def saveAsTable(tableName: String): Unit =
         sqlContext.executePlan(InsertIntoCreatedTable(None, tableName, logicalPlan)).toRdd
     
    +  /** Returns the schema. */
    +  def schema: StructType = queryExecution.analyzed.schema
    +
       /** Returns the output schema in the tree format. */
    -  def schemaString: String = queryExecution.analyzed.schemaString
    +  def formattedSchemaString: String = schema.formattedSchemaString
    --- End diff --
    
    Do we have to change this API?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1346#discussion_r15482765
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/api/java/types/package-info.java ---
    @@ -0,0 +1,22 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +
    +/**
    + * Allows users to get and create Spark SQL data types.
    + */
    +package org.apache.spark.sql.api.java.types;
    --- End diff --
    
    Newline at end of file.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1346#discussion_r15483749
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/api/java/types/DataType.java ---
    @@ -0,0 +1,170 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.api.java.types;
    +
    +import java.util.HashSet;
    +import java.util.List;
    +import java.util.Set;
    +
    +/**
    + * The base type of all Spark SQL data types.
    + */
    +public abstract class DataType {
    +
    +  /**
    +   * Gets the StringType object.
    +   */
    +  public static final StringType StringType = new StringType();
    +
    +  /**
    +   * Gets the BinaryType object.
    +   */
    +  public static final BinaryType BinaryType = new BinaryType();
    +
    +  /**
    +   * Gets the BooleanType object.
    +   */
    +  public static final BooleanType BooleanType = new BooleanType();
    +
    +  /**
    +   * Gets the TimestampType object.
    +   */
    +  public static final TimestampType TimestampType = new TimestampType();
    +
    +  /**
    +   * Gets the DecimalType object.
    +   */
    +  public static final DecimalType DecimalType = new DecimalType();
    +
    +  /**
    +   * Gets the DoubleType object.
    +   */
    +  public static final DoubleType DoubleType = new DoubleType();
    +
    +  /**
    +   * Gets the FloatType object.
    +   */
    +  public static final FloatType FloatType = new FloatType();
    +
    +  /**
    +   * Gets the ByteType object.
    +   */
    +  public static final ByteType ByteType = new ByteType();
    +
    +  /**
    +   * Gets the IntegerType object.
    +   */
    +  public static final IntegerType IntegerType = new IntegerType();
    +
    +  /**
    +   * Gets the LongType object.
    +   */
    +  public static final LongType LongType = new LongType();
    +
    +  /**
    +   * Gets the ShortType object.
    +   */
    +  public static final ShortType ShortType = new ShortType();
    +
    +  /**
    +   * Creates an ArrayType by specifying the data type of elements ({@code elementType}) and
    +   * whether the array contains null values ({@code containsNull}).
    +   * @param elementType
    +   * @param containsNull
    +   * @return
    +   */
    +  public static ArrayType createArrayType(DataType elementType, boolean containsNull) {
    --- End diff --
    
    Add another method that has a default for containsNull?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

Posted by yhuai <gi...@git.apache.org>.

Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1346#discussion_r15554197
  
    --- Diff: python/pyspark/sql.py ---
    @@ -20,8 +20,457 @@
     
     from py4j.protocol import Py4JError
     
    -__all__ = ["SQLContext", "HiveContext", "LocalHiveContext", "TestHiveContext", "SchemaRDD", "Row"]
    +__all__ = [
    +    "StringType", "BinaryType", "BooleanType", "TimestampType", "DecimalType",
    +    "DoubleType", "FloatType", "ByteType", "IntegerType", "LongType",
    +    "ShortType", "ArrayType", "MapType", "StructField", "StructType",
    +    "SQLContext", "HiveContext", "LocalHiveContext", "TestHiveContext", "SchemaRDD", "Row"]
     
    +class PrimitiveTypeSingleton(type):
    +    _instances = {}
    +    def __call__(cls):
    +        if cls not in cls._instances:
    +            cls._instances[cls] = super(PrimitiveTypeSingleton, cls).__call__()
    +        return cls._instances[cls]
    +
    +class StringType(object):
    +    """Spark SQL StringType
    +
    +    The data type representing string values.
    +
    +    """
    +    __metaclass__ = PrimitiveTypeSingleton
    +
    +    def __repr__(self):
    +        return "StringType"
    +
    +class BinaryType(object):
    +    """Spark SQL BinaryType
    +
    +    The data type representing bytearray values.
    +
    +    """
    +    __metaclass__ = PrimitiveTypeSingleton
    +
    +    def __repr__(self):
    +        return "BinaryType"
    +
    +class BooleanType(object):
    +    """Spark SQL BooleanType
    +
    +    The data type representing bool values.
    +
    +    """
    +    __metaclass__ = PrimitiveTypeSingleton
    +
    +    def __repr__(self):
    +        return "BooleanType"
    +
    +class TimestampType(object):
    +    """Spark SQL TimestampType
    +
    +    The data type representing datetime.datetime values.
    +
    +    """
    +    __metaclass__ = PrimitiveTypeSingleton
    +
    +    def __repr__(self):
    +        return "TimestampType"
    +
    +class DecimalType(object):
    +    """Spark SQL DecimalType
    +
    +    The data type representing decimal.Decimal values.
    +
    +    """
    +    __metaclass__ = PrimitiveTypeSingleton
    +
    +    def __repr__(self):
    +        return "DecimalType"
    +
    +class DoubleType(object):
    +    """Spark SQL DoubleType
    +
    +    The data type representing float values.
    +
    +    """
    +    __metaclass__ = PrimitiveTypeSingleton
    +
    +    def __repr__(self):
    +        return "DoubleType"
    +
    +class FloatType(object):
    +    """Spark SQL FloatType
    +
    +    For now, please use L{DoubleType} instead of using L{FloatType}.
    +    Because query evaluation is done in Scala, java.lang.Double will be be used
    +    for Python float numbers. Because the underlying JVM type of FloatType is
    +    java.lang.Float (in Java) and Float (in scala), there will be a java.lang.ClassCastException
    +    if FloatType (Python) is used.
    +
    +    """
    +    __metaclass__ = PrimitiveTypeSingleton
    +
    +    def __repr__(self):
    +        return "FloatType"
    +
    +class ByteType(object):
    +    """Spark SQL ByteType
    +
    +    For now, please use L{IntegerType} instead of using L{ByteType}.
    +    Because query evaluation is done in Scala, java.lang.Integer will be be used
    +    for Python int numbers. Because the underlying JVM type of ByteType is
    +    java.lang.Byte (in Java) and Byte (in scala), there will be a java.lang.ClassCastException
    +    if ByteType (Python) is used.
    +
    +    """
    +    __metaclass__ = PrimitiveTypeSingleton
    +
    +    def __repr__(self):
    +        return "ByteType"
    +
    +class IntegerType(object):
    +    """Spark SQL IntegerType
    +
    +    The data type representing int values.
    +
    +    """
    +    __metaclass__ = PrimitiveTypeSingleton
    +
    +    def __repr__(self):
    +        return "IntegerType"
    +
    +class LongType(object):
    +    """Spark SQL LongType
    +
    +    The data type representing long values. If the any value is beyond the range of
    +    [-9223372036854775808, 9223372036854775807], please use DecimalType.
    +
    +    """
    +    __metaclass__ = PrimitiveTypeSingleton
    +
    +    def __repr__(self):
    +        return "LongType"
    +
    +class ShortType(object):
    +    """Spark SQL ShortType
    +
    +    For now, please use L{IntegerType} instead of using L{ShortType}.
    --- End diff --
    
    Yes,  we should provide convenient methods for users. But, we will provide methods for users to load CSV files and we will use mutable projection to do the type conversions (by using `Cast`).
    
    Considering the size of this PR and it is blocking other people's work,  it is better to think about it later.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1346#issuecomment-50532980
  
    QA tests have started for PR 1346. This patch DID NOT merge cleanly! <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17374/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

Posted by chutium <gi...@git.apache.org>.

Github user chutium commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1346#discussion_r15766362
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala ---
    @@ -89,6 +88,44 @@ class SQLContext(@transient val sparkContext: SparkContext)
         new SchemaRDD(this, SparkLogicalPlan(ExistingRdd.fromProductRdd(rdd))(self))
     
       /**
    +   * :: DeveloperApi ::
    +   * Creates a [[SchemaRDD]] from an [[RDD]] containing [[Row]]s by applying a schema to this RDD.
    +   * It is important to make sure that the structure of every [[Row]] of the provided RDD matches
    +   * the provided schema. Otherwise, there will be runtime exception.
    +   * Example:
    +   * {{{
    +   *  import org.apache.spark.sql._
    +   *  val sqlContext = new org.apache.spark.sql.SQLContext(sc)
    +   *
    +   *  val schema =
    +   *    StructType(
    +   *      StructField("name", StringType, false) ::
    +   *      StructField("age", IntegerType, true) :: Nil)
    +   *
    --- End diff --
    
    Hi @yhuai , why we need to define schema as a StructType, but not directly as a Seq[StructField]? i tried to build a Seq[StructField] from JDBC metadata in #1612 https://github.com/apache/spark/pull/1612/files#diff-3 (it followed the code of your JsonRDD :)
    
    it seems we do not need this StructType anywhere.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [WIP][SPARK-2179][SQL] Public API for DataType...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1346#issuecomment-48798545
  
    QA tests have started for PR 1346. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16582/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1346#issuecomment-50293500
  
    QA results for PR 1346:<br>- This patch PASSES unit tests.<br>- This patch merges cleanly<br>- This patch adds no public classes<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17263/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1346#discussion_r15482279
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/api/java/types/DataType.java ---
    @@ -0,0 +1,170 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.api.java.types;
    +
    +import java.util.HashSet;
    +import java.util.List;
    +import java.util.Set;
    +
    +/**
    + * The base type of all Spark SQL data types.
    --- End diff --
    
    I'd also talk about how this class contains singletons and factory methods for constructing datatypes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

Posted by yhuai <gi...@git.apache.org>.

Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1346#discussion_r15493024
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala ---
    @@ -374,4 +444,97 @@ class SQLContext(@transient val sparkContext: SparkContext)
         new SchemaRDD(this, SparkLogicalPlan(ExistingRdd(schema, rowRdd)))
       }
     
    +  /**
    +   * Returns the equivalent StructField in Scala for the given StructField in Java.
    +   */
    +  protected def asJavaStructField(scalaStructField: StructField): JStructField = {
    --- End diff --
    
    Will move them to a better place.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on the pull request:

    https://github.com/apache/spark/pull/1346#issuecomment-50582008
  
    Thanks for working on this!  Merged to master.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1346#discussion_r15481293
  
    --- Diff: python/pyspark/sql.py ---
    @@ -20,8 +20,413 @@
     
     from py4j.protocol import Py4JError
     
    -__all__ = ["SQLContext", "HiveContext", "LocalHiveContext", "TestHiveContext", "SchemaRDD", "Row"]
    +__all__ = [
    +    "StringType", "BinaryType", "BooleanType", "DecimalType", "DoubleType",
    +    "FloatType", "ByteType", "IntegerType", "LongType", "ShortType",
    +    "ArrayType", "MapType", "StructField", "StructType",
    +    "SQLContext", "HiveContext", "LocalHiveContext", "TestHiveContext", "SchemaRDD", "Row"]
     
    +class PrimitiveTypeSingleton(type):
    +    _instances = {}
    +    def __call__(cls):
    +        if cls not in cls._instances:
    +            cls._instances[cls] = super(PrimitiveTypeSingleton, cls).__call__()
    +        return cls._instances[cls]
    +
    +class StringType(object):
    +    """Spark SQL StringType
    +
    +    The data type representing string values.
    +
    +    """
    +    __metaclass__ = PrimitiveTypeSingleton
    +
    +    def _get_scala_type_string(self):
    +        return "StringType"
    +
    +class BinaryType(object):
    +    """Spark SQL BinaryType
    +
    +    The data type representing bytes values and bytearray values.
    +
    +    """
    +    __metaclass__ = PrimitiveTypeSingleton
    +
    +    def _get_scala_type_string(self):
    +        return "BinaryType"
    +
    +class BooleanType(object):
    +    """Spark SQL BooleanType
    +
    +    The data type representing bool values.
    +
    +    """
    +    __metaclass__ = PrimitiveTypeSingleton
    +
    +    def _get_scala_type_string(self):
    +        return "BooleanType"
    +
    +class TimestampType(object):
    +    """Spark SQL TimestampType"""
    --- End diff --
    
    We should also list the python types that are expected when its not obvious.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1346#discussion_r15482888
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala ---
    @@ -89,6 +90,44 @@ class SQLContext(@transient val sparkContext: SparkContext)
         new SchemaRDD(this, SparkLogicalPlan(ExistingRdd.fromProductRdd(rdd)))
     
       /**
    +   * :: DeveloperApi ::
    +   * Creates a [[SchemaRDD]] from an [[RDD]] containing [[Row]]s by applying a schema to this RDD.
    +   * It is important to make sure that the structure of every [[Row]] of the provided RDD matches
    +   * the provided schema. Otherwise, there will be runtime exception.
    +   *
    +   * @group userf
    +   */
    +  @DeveloperApi
    +  def applySchema(rowRDD: RDD[Row], schema: StructType): SchemaRDD = {
    +    // TODO: use MutableProjection when rowRDD is another SchemaRDD and the applied
    +    // schema differs from the existing schema on any field data type.
    +    val logicalPlan = SparkLogicalPlan(ExistingRdd(schema.toAttributes, rowRDD))
    +    new SchemaRDD(this, logicalPlan)
    +  }
    +
    +  /**
    +   * Parses the data type in our internal string representation. The data type string should
    +   * have the same format as the one generate by `toString` in scala.
    --- End diff --
    
    Is this only here for pyspark?  Probably should make a note of that.
    
    Nit: "generated"


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [WIP][SPARK-2179][SQL] Public API for DataType...

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1346#discussion_r14904635
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala ---
    @@ -89,6 +88,33 @@ class SQLContext(@transient val sparkContext: SparkContext)
         new SchemaRDD(this, SparkLogicalPlan(ExistingRdd.fromProductRdd(rdd)))
     
       /**
    +   * Creates a [[SchemaRDD]] from an [[RDD]] by applying a schema to this RDD and using a function
    +   * that will be applied to each partition of the RDD to convert RDD records to [[Row]]s.
    +   *
    +   * @group userf
    +   */
    +  def applySchema[A](rdd: RDD[A], schema: StructType, f: A => Row): SchemaRDD =
    +    applySchemaToPartitions(rdd, schema, (iter: Iterator[A]) => iter.map(f))
    +
    +  /**
    +   * Creates a [[SchemaRDD]] from an [[RDD]] by applying a schema to this RDD and using a function
    +   * that will be applied to each partition of the RDD to convert RDD records to [[Row]]s.
    --- End diff --
    
    Maybe provide some guidance here on when you'd want to use this function.
    
    > Similar to `RDD.mapPartitions`, this function can be used to improve performance where there is other setup work that can be amortized and used repeatedly for all of the elements in a partition.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

Posted by chutium <gi...@git.apache.org>.

Github user chutium commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1346#discussion_r15799720
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala ---
    @@ -89,6 +88,44 @@ class SQLContext(@transient val sparkContext: SparkContext)
         new SchemaRDD(this, SparkLogicalPlan(ExistingRdd.fromProductRdd(rdd))(self))
     
       /**
    +   * :: DeveloperApi ::
    +   * Creates a [[SchemaRDD]] from an [[RDD]] containing [[Row]]s by applying a schema to this RDD.
    +   * It is important to make sure that the structure of every [[Row]] of the provided RDD matches
    +   * the provided schema. Otherwise, there will be runtime exception.
    +   * Example:
    +   * {{{
    +   *  import org.apache.spark.sql._
    +   *  val sqlContext = new org.apache.spark.sql.SQLContext(sc)
    +   *
    +   *  val schema =
    +   *    StructType(
    +   *      StructField("name", StringType, false) ::
    +   *      StructField("age", IntegerType, true) :: Nil)
    +   *
    --- End diff --
    
    o, yep, StructType is needed, i mean
    ```def applySchema(rowRDD: RDD[Row], schema: StructType): SchemaRDD```
    could be
    ```def applySchema(rowRDD: RDD[Row], schema: Seq[StructField]): SchemaRDD```
    
    then we do not need to always use ```schema.fields.map(f => AttributeReference...)```
    
    we can direct ```schema.map(f => AttributeReference...)```
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

Posted by concretevitamin <gi...@git.apache.org>.

Github user concretevitamin commented on the pull request:

    https://github.com/apache/spark/pull/1346#issuecomment-50423690
  
    @yhuai @marmbrus I am not sure if this has been discussed before, but what do you guys think about adding a version of `applySchema(RDD[Array[String]], StructType)`? 
    
    The use case I have in mind is TPC-DS data preparation. Currently I have a bunch of text files, from which I can easily create an `RDD[String]`; by splitting each line on some separator I get an `RDD[Array[String]]`. Now, in TPC-DS the tables easily have 15+ columns, and I don't want to manually create a `Row` for each `Array[String]`. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [WIP][SPARK-2179][SQL] Public API for DataType...

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1346#discussion_r14904071
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/types/dataTypes.scala ---
    @@ -197,47 +213,145 @@ object FractionalType {
       }
     }
     abstract class FractionalType extends NumericType {
    -  val fractional: Fractional[JvmType]
    +  private[sql] val fractional: Fractional[JvmType]
     }
     
     case object DecimalType extends FractionalType {
    -  type JvmType = BigDecimal
    -  @transient lazy val tag = typeTag[JvmType]
    -  val numeric = implicitly[Numeric[BigDecimal]]
    -  val fractional = implicitly[Fractional[BigDecimal]]
    -  val ordering = implicitly[Ordering[JvmType]]
    +  private[sql] type JvmType = BigDecimal
    +  @transient private[sql] lazy val tag = typeTag[JvmType]
    +  private[sql] val numeric = implicitly[Numeric[BigDecimal]]
    +  private[sql] val fractional = implicitly[Fractional[BigDecimal]]
    +  private[sql] val ordering = implicitly[Ordering[JvmType]]
    +  def simpleString: String = "decimal"
     }
     
     case object DoubleType extends FractionalType {
    -  type JvmType = Double
    -  @transient lazy val tag = typeTag[JvmType]
    -  val numeric = implicitly[Numeric[Double]]
    -  val fractional = implicitly[Fractional[Double]]
    -  val ordering = implicitly[Ordering[JvmType]]
    +  private[sql] type JvmType = Double
    +  @transient private[sql] lazy val tag = typeTag[JvmType]
    +  private[sql] val numeric = implicitly[Numeric[Double]]
    +  private[sql] val fractional = implicitly[Fractional[Double]]
    +  private[sql] val ordering = implicitly[Ordering[JvmType]]
    +  def simpleString: String = "double"
     }
     
     case object FloatType extends FractionalType {
    -  type JvmType = Float
    -  @transient lazy val tag = typeTag[JvmType]
    -  val numeric = implicitly[Numeric[Float]]
    -  val fractional = implicitly[Fractional[Float]]
    -  val ordering = implicitly[Ordering[JvmType]]
    +  private[sql] type JvmType = Float
    +  @transient private[sql] lazy val tag = typeTag[JvmType]
    +  private[sql] val numeric = implicitly[Numeric[Float]]
    +  private[sql] val fractional = implicitly[Fractional[Float]]
    +  private[sql] val ordering = implicitly[Ordering[JvmType]]
    +  def simpleString: String = "float"
     }
     
    -case class ArrayType(elementType: DataType) extends DataType
    +object ArrayType {
    +  def apply(elementType: DataType): ArrayType = ArrayType(elementType, false)
    +}
     
    -case class StructField(name: String, dataType: DataType, nullable: Boolean)
    +case class ArrayType(elementType: DataType, containsNull: Boolean) extends DataType {
    +  private[sql] def buildFormattedString(prefix: String, builder: StringBuilder): Unit = {
    +    builder.append(
    +      s"${prefix}-- element: ${elementType.simpleString} (containsNull = ${containsNull})\n")
    +    elementType match {
    +      case array: ArrayType =>
    +        array.buildFormattedString(s"$prefix    |", builder)
    +      case struct: StructType =>
    +        struct.buildFormattedString(s"$prefix    |", builder)
    +      case map: MapType =>
    +        map.buildFormattedString(s"$prefix    |", builder)
    +      case _ =>
    +    }
    +  }
    +
    +  def simpleString: String = "array"
    +}
    +
    +case class StructField(name: String, dataType: DataType, nullable: Boolean) {
    +
    +  private[sql] def buildFormattedString(prefix: String, builder: StringBuilder): Unit = {
    +    builder.append(s"${prefix}-- ${name}: ${dataType.simpleString} (nullable = ${nullable})\n")
    +    dataType match {
    +      case array: ArrayType =>
    +        array.buildFormattedString(s"$prefix    |", builder)
    +      case struct: StructType =>
    +        struct.buildFormattedString(s"$prefix    |", builder)
    +      case map: MapType =>
    +        map.buildFormattedString(s"$prefix    |", builder)
    +      case _ =>
    +    }
    +  }
    +}
     
     object StructType {
    -  def fromAttributes(attributes: Seq[Attribute]): StructType = {
    +  def fromAttributes(attributes: Seq[Attribute]): StructType =
         StructType(attributes.map(a => StructField(a.name, a.dataType, a.nullable)))
    -  }
     
    -  // def apply(fields: Seq[StructField]) = new StructType(fields.toIndexedSeq)
    +  private def validateFields(fields: Seq[StructField]): Boolean =
    +    fields.map(field => field.name).distinct.size == fields.size
    +
    +  def apply[A <: String: ClassTag, B <: DataType: ClassTag](fields: (A, B)*): StructType =
    +    StructType(fields.map(field => StructField(field._1, field._2, true)))
    +
    +  def apply[A <: String: ClassTag, B <: DataType: ClassTag, C <: Boolean: ClassTag](
    +      fields: (A, B, C)*): StructType =
    +    StructType(fields.map(field => StructField(field._1, field._2, field._3)))
     }
     
     case class StructType(fields: Seq[StructField]) extends DataType {
    +  require(StructType.validateFields(fields), "Found fields with the same name.")
    +
    +  def apply(name: String): StructField = {
    --- End diff --
    
    Scaladoc please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1346#discussion_r15511498
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/types/util/DataTypeConversions.scala ---
    @@ -0,0 +1,124 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.types.util
    +
    +import org.apache.spark.sql._
    +import org.apache.spark.sql.api.java.types.{DataType => JDataType, StructField => JStructField}
    +
    +import scala.collection.JavaConverters._
    +
    +protected[sql] object DataTypeConversions {
    +
    +  /**
    +   * Returns the equivalent StructField in Scala for the given StructField in Java.
    +   */
    +  def asJavaStructField(scalaStructField: StructField): JStructField = {
    +    org.apache.spark.sql.api.java.types.DataType.createStructField(
    +      scalaStructField.name,
    +      asJavaDataType(scalaStructField.dataType),
    +      scalaStructField.nullable)
    +  }
    +
    +  /**
    +   * Returns the equivalent DataType in Java for the given DataType in Scala.
    +   */
    +  def asJavaDataType(scalaDataType: DataType): JDataType = scalaDataType match {
    +    case StringType =>
    +      org.apache.spark.sql.api.java.types.DataType.StringType
    --- End diff --
    
    Why not just ```JDataType. StringType``` instead of typing all the names?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [WIP][SPARK-2179][SQL] Public API for DataType...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/1346#issuecomment-48523579
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [WIP][SPARK-2179][SQL] Public API for DataType...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1346#issuecomment-48673429
  
    QA tests have started for PR 1346. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16528/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

Posted by yhuai <gi...@git.apache.org>.

Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1346#discussion_r15499886
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/api/java/types/DataType.java ---
    @@ -0,0 +1,170 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.api.java.types;
    +
    +import java.util.HashSet;
    +import java.util.List;
    +import java.util.Set;
    +
    +/**
    + * The base type of all Spark SQL data types.
    --- End diff --
    
    Done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [WIP][SPARK-2179][SQL] Public API for DataType...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1346#issuecomment-49123530
  
    QA tests have started for PR 1346. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16711/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [WIP][SPARK-2179][SQL] Public API for DataType...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/1346#issuecomment-48673744
  
    
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16527/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/1346


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [WIP][SPARK-2179][SQL] Public API for DataType...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1346#discussion_r14754430
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/types/dataTypes.scala ---
    @@ -93,47 +92,56 @@ abstract class DataType {
       }
     
       def isPrimitive: Boolean = false
    +
    +  def simpleString: String
     }
     
    -case object NullType extends DataType
    +case object NullType extends DataType {
    +  def simpleString: String = "null"
    +}
     
     trait PrimitiveType extends DataType {
       override def isPrimitive = true
     }
     
     abstract class NativeType extends DataType {
    -  type JvmType
    -  @transient val tag: TypeTag[JvmType]
    -  val ordering: Ordering[JvmType]
    +  private[sql] type JvmType
    +  @transient private[sql] val tag: TypeTag[JvmType]
    +  private[sql] val ordering: Ordering[JvmType]
     
    -  @transient val classTag = {
    +  @transient private[sql] val classTag = {
         val mirror = runtimeMirror(Utils.getSparkClassLoader)
         ClassTag[JvmType](mirror.runtimeClass(tag.tpe))
       }
     }
     
     case object StringType extends NativeType with PrimitiveType {
    -  type JvmType = String
    -  @transient lazy val tag = typeTag[JvmType]
    -  val ordering = implicitly[Ordering[JvmType]]
    +  private[sql] type JvmType = String
    +  @transient private[sql] lazy val tag = typeTag[JvmType]
    +  private[sql] val ordering = implicitly[Ordering[JvmType]]
    +  def simpleString: String = "string"
     }
    --- End diff --
    
    while you at it, add a blank line to separate each class


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1346#discussion_r15801894
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala ---
    @@ -89,6 +88,44 @@ class SQLContext(@transient val sparkContext: SparkContext)
         new SchemaRDD(this, SparkLogicalPlan(ExistingRdd.fromProductRdd(rdd))(self))
     
       /**
    +   * :: DeveloperApi ::
    +   * Creates a [[SchemaRDD]] from an [[RDD]] containing [[Row]]s by applying a schema to this RDD.
    +   * It is important to make sure that the structure of every [[Row]] of the provided RDD matches
    +   * the provided schema. Otherwise, there will be runtime exception.
    +   * Example:
    +   * {{{
    +   *  import org.apache.spark.sql._
    +   *  val sqlContext = new org.apache.spark.sql.SQLContext(sc)
    +   *
    +   *  val schema =
    +   *    StructType(
    +   *      StructField("name", StringType, false) ::
    +   *      StructField("age", IntegerType, true) :: Nil)
    +   *
    --- End diff --
    
    This might be crazy... but if `StructType <: Seq[StructField]` then we could pass in either `StructType` or `Seq[StructField]`.  Should be possible to do this fairly easily


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org