You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by mgaido91 <gi...@git.apache.org> on 2017/10/13 12:43:47 UTC

[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array
GitHub user mgaido91 opened a pull request:

    https://github.com/apache/spark/pull/19492

    [SPARK-22228][SQL] Add support for array<primitive_type> to from_json

    ## What changes were proposed in this pull request?
    
    The fix introduces support for parsing array of primitive types in the `from_json` function. Currently, it supports only array of struct, which is equivalent to array of JSON objects, but it doesn't allow to parse array of simple types, like array of strings (eg. `["this is a valid JSON", "we cannot parse before the PR"]`).
    
    ## How was this patch tested?
    
    Added UTs.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/mgaido91/spark SPARK-22228

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/19492.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #19492
    
----
commit cba1f4accaf5dbe305e366eab33879577d7f9a93
Author: Marco Gaido <mg...@hortonworks.com>
Date:   2017-10-11T10:29:26Z

    [SPARK-22228] Add support for array<primitive_type> to from_json

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19492: [SPARK-22228][SQL] Add support for array...

Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on the issue:

    https://github.com/apache/spark/pull/19492
  
    @HyukjinKwon I am not sure that the amount of change is more important than fixing a misbehavior, since now we are rejecting to parse valid JSONs. Moreover, I think that most of the complexity introduced is to ensure that we don't change behavior in any other place, namely that we don't affect the JSON datasource. When/if we will extend the support to array of primitives also there, then most of the divergence can be removed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19492#discussion_r144792285
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala ---
    @@ -536,26 +536,31 @@ case class JsonToStructs(
           timeZoneId = None)
     
       override def checkInputDataTypes(): TypeCheckResult = schema match {
    -    case _: StructType | ArrayType(_: StructType, _) =>
    +    case _: StructType | ArrayType(_: StructType | _: AtomicType, _) =>
           super.checkInputDataTypes()
         case _ => TypeCheckResult.TypeCheckFailure(
    -      s"Input schema ${schema.simpleString} must be a struct or an array of structs.")
    +      s"Input schema ${schema.simpleString} must be a struct or " +
    +        s"an array of structs or primitive types.")
       }
     
       @transient
    -  lazy val rowSchema = schema match {
    +  lazy val rowSchema: DataType = schema match {
    --- End diff --
    
    I can't understand the difference. Anyway, if you think `dataSchema` is more appropriate, I can rename it.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19492#discussion_r144757530
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/JsonFunctionsSuite.scala ---
    @@ -170,6 +160,31 @@ class JsonFunctionsSuite extends QueryTest with SharedSQLContext {
           Row(Row(1, "haa")) :: Nil)
       }
     
    +  test("SPARK-22228: from_json should support also arrays of primitive types") {
    +    val dfInt = Seq("[1]", "[2, 3]").toDS()
    --- End diff --
    
    Add a `[]` in to the data?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19492: [SPARK-22228][SQL] Add support for array...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19492
  
    **[Test build #85509 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85509/testReport)** for PR 19492 at commit [`e29226b`](https://github.com/apache/spark/commit/e29226bd04f047e28060616c5db4d9df6cdd3090).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19492#discussion_r144835433
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala ---
    @@ -343,6 +367,25 @@ class JacksonParser(
           record: T,
           createParser: (JsonFactory, T) => JsonParser,
           recordLiteral: T => UTF8String): Seq[InternalRow] = {
    +    parseWithArrayOfPrimitiveSupport(record, createParser, recordLiteral) match {
    +      case rows: Seq[InternalRow] => rows
    +      case _: Seq[_] => throw BadRecordException(() => recordLiteral(record), () => None,
    +        new RuntimeException("Conversion of array of primitive data is not yet supported here."))
    --- End diff --
    
    To clarify it, I think the only way we throw this exception is passing an `ArrayType` into `JacksonParser` constructor and call `parse` instead of `parseWithArrayOfPrimitiveSupport`. Because `JacksonParser` is internally used, I assume this usage is intentional and the developer will get the exception right away.
    
    So I didn't think this exception will possibly be seen by end user, unless we ship such broken codes to users in Spark releases.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19492: [SPARK-22228][SQL] Add support for array...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19492
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19492#discussion_r144837336
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala ---
    @@ -343,6 +367,25 @@ class JacksonParser(
           record: T,
           createParser: (JsonFactory, T) => JsonParser,
           recordLiteral: T => UTF8String): Seq[InternalRow] = {
    +    parseWithArrayOfPrimitiveSupport(record, createParser, recordLiteral) match {
    +      case rows: Seq[InternalRow] => rows
    +      case _: Seq[_] => throw BadRecordException(() => recordLiteral(record), () => None,
    +        new RuntimeException("Conversion of array of primitive data is not yet supported here."))
    --- End diff --
    
    yes, you are right since we have only two constructors which enforce this patters. Then I will edit the exception message according to your suggestion, thanks.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19492: [SPARK-22228][SQL] Add support for array...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/19492
  
    ok to test


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19492: [SPARK-22228][SQL] Add support for array...

Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on the issue:

    https://github.com/apache/spark/pull/19492
  
    @HyukjinKwon as there were questions about whether this PR is useful or not, what do you think? Shall we go on on this or shall I close it? Thanks.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19492: [SPARK-22228][SQL] Add support for array...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19492
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19492: [SPARK-22228][SQL] Add support for array...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/19492
  
    To be honest, just to share what have been on my mind on this so far, I have been trying to think if it's worth vs the amount of changes, in particular, we now happen to have some divergence, `makeRootConverter ` / `parseWithArrayOfPrimitiveSupport`. I know why they had to be (probably I guess @viirya is aware of it too assuming previous discussion in another PR).
    
    Will try to take a look few times more and will share my feedback in near future. Meanwhile, @viirya WDYT on this PR? 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19492: [SPARK-22228][SQL] Add support for array...

Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on the issue:

    https://github.com/apache/spark/pull/19492
  
    kindly ping @viirya @HyukjinKwon 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19492: [SPARK-22228][SQL] Add support for array...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19492
  
    **[Test build #85516 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85516/testReport)** for PR 19492 at commit [`29d6b96`](https://github.com/apache/spark/commit/29d6b96f0d118f2fac3c80b7a5946c83bfca480b).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19492#discussion_r144837080
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala ---
    @@ -343,6 +368,25 @@ class JacksonParser(
           record: T,
           createParser: (JsonFactory, T) => JsonParser,
           recordLiteral: T => UTF8String): Seq[InternalRow] = {
    +    parseWithArrayOfPrimitiveSupport(record, createParser, recordLiteral) match {
    +      case rows: Seq[InternalRow] => rows
    +      case _: Seq[_] => throw BadRecordException(() => recordLiteral(record), () => None,
    +        new RuntimeException("Conversion of array of primitive data is not yet supported here."))
    +    }
    +  }
    +
    +  /**
    +   * Parse the JSON input. This function can return a set of [[InternalRow]]s
    +   * if a [[StructType]] is defined as schema, otherwise it returns a set of
    +   * objects.
    --- End diff --
    
    Please add comment that this is used when passing `ArrayType` of primitive types.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19492: [SPARK-22228][SQL] Add support for array...

Posted by felixcheung <gi...@git.apache.org>.
Github user felixcheung commented on the issue:

    https://github.com/apache/spark/pull/19492
  
    @HyukjinKwon 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19492: [SPARK-22228][SQL] Add support for array...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19492
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19492#discussion_r144783859
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala ---
    @@ -35,19 +35,25 @@ import org.apache.spark.util.Utils
     /**
      * Constructs a parser for a given schema that translates a json string to an [[InternalRow]].
      */
    -class JacksonParser(
    -    schema: StructType,
    +private[sql] class JacksonParser(
    +    schema: DataType,
         val options: JSONOptions) extends Logging {
     
       import JacksonUtils._
       import com.fasterxml.jackson.core.JsonToken._
     
    +  def this(schema: StructType, options: JSONOptions) = this(schema: DataType, options)
    +  def this(schema: ArrayType, options: JSONOptions) = this(schema: DataType, options)
    --- End diff --
    
    These are to avoid that someone use the constructor specifying invalid `DataType`, ie. anything which is not a `StructType` or `ArrayType`.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19492: [SPARK-22228][SQL] Add support for array...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19492
  
    **[Test build #85509 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85509/testReport)** for PR 19492 at commit [`e29226b`](https://github.com/apache/spark/commit/e29226bd04f047e28060616c5db4d9df6cdd3090).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 closed the pull request at:

    https://github.com/apache/spark/pull/19492


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19492: [SPARK-22228][SQL] Add support for array...

Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on the issue:

    https://github.com/apache/spark/pull/19492
  
    @viirya did you have any chance to look at this? Thanks.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19492: [SPARK-22228][SQL] Add support for array...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/19492
  
    In fact I don't feel strong preference to support array of primitive types in `from_json`.
    
    The reason is that I think at most time, we will use json object instead of json array at top-level. Is it common to use json array to store primitive type of data like this? Seems to me it doesn't have more advantage than just a string with special delimiter such as "," or "|".
    



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19492: [SPARK-22228][SQL] Add support for array...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19492
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/377/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19492: [SPARK-22228][SQL] Add support for array...

Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on the issue:

    https://github.com/apache/spark/pull/19492
  
    @viirya sorry, do you have any more comments?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19492#discussion_r144753534
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala ---
    @@ -536,26 +536,31 @@ case class JsonToStructs(
           timeZoneId = None)
     
       override def checkInputDataTypes(): TypeCheckResult = schema match {
    -    case _: StructType | ArrayType(_: StructType, _) =>
    +    case _: StructType | ArrayType(_: StructType | _: AtomicType, _) =>
           super.checkInputDataTypes()
         case _ => TypeCheckResult.TypeCheckFailure(
    -      s"Input schema ${schema.simpleString} must be a struct or an array of structs.")
    +      s"Input schema ${schema.simpleString} must be a struct or " +
    +        s"an array of structs or primitive types.")
       }
     
       @transient
    -  lazy val rowSchema = schema match {
    +  lazy val rowSchema: DataType = schema match {
         case st: StructType => st
         case ArrayType(st: StructType, _) => st
    +    case ArrayType(at: AtomicType, _) => ArrayType(at)
       }
     
       // This converts parsed rows to the desired output by the given schema.
       @transient
    -  lazy val converter = schema match {
    -    case _: StructType =>
    -      (rows: Seq[InternalRow]) => if (rows.length == 1) rows.head else null
    -    case ArrayType(_: StructType, _) =>
    -      (rows: Seq[InternalRow]) => new GenericArrayData(rows)
    -  }
    +  lazy val converter = (rows: Seq[Any]) =>
    --- End diff --
    
    This brings extra matching cost at runtime. Can we move matching outside?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19492#discussion_r144782180
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala ---
    @@ -89,6 +95,24 @@ class JacksonParser(
     
       /**
        * Create a converter which converts the JSON documents held by the `JsonParser`
    +   * to a value according to a desired schema. This is an overloaded method to the
    +   * previous one which allows to handle array of primitive types.
    +   */
    +  private def makeRootConverter(at: ArrayType): JsonParser => Seq[Any] = {
    +    (parser: JsonParser) => parseJsonToken[Seq[Any]](parser, at) {
    +      case START_ARRAY =>
    +        val array = convertArray(parser, makeConverter(at.elementType))
    +        if (array.numElements() == 0) {
    +          Nil
    +        } else {
    +          array.toArray(at.elementType).toSeq
    +        }
    +      case _ => Nil
    --- End diff --
    
    You are right, I will remove this.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19492: [SPARK-22228][SQL] Add support for array...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19492
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85509/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19492#discussion_r144757908
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/JsonFunctionsSuite.scala ---
    @@ -170,6 +160,31 @@ class JsonFunctionsSuite extends QueryTest with SharedSQLContext {
           Row(Row(1, "haa")) :: Nil)
       }
     
    +  test("SPARK-22228: from_json should support also arrays of primitive types") {
    +    val dfInt = Seq("[1]", "[2, 3]").toDS()
    +    checkAnswer(
    +      dfInt.select(from_json($"value", ArrayType(IntegerType))),
    +      Row(Seq(1)) :: Row(Seq(2, 3)) :: Nil)
    +
    +    val dfString = Seq("""["hello", "world", ""]""").toDS()
    +    checkAnswer(
    +      dfString.select(from_json($"value", ArrayType(StringType))),
    +      Row(Seq("hello", "world", "")):: Nil)
    +
    +    val dfTimestamp = Seq("""["26/08/2015 18:00"]""").toDS()
    +    val schema = ArrayType(TimestampType)
    +    val options = Map("timestampFormat" -> "dd/MM/yyyy HH:mm")
    +
    +    checkAnswer(
    +      dfTimestamp.select(from_json($"value", schema, options)),
    +      Row(Seq(java.sql.Timestamp.valueOf("2015-08-26 18:00:00.0"))))
    +
    +    val dfEmpty = Seq("""[]""").toDS()
    +    checkAnswer(
    +      dfEmpty.select(from_json($"value", ArrayType(StringType))),
    +      Row(Nil):: Nil)
    --- End diff --
    
    And also add a case that can fail the parsing like `Seq(""""[1]", "[2, 3]", "[string]""").toDS()`?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19492: [SPARK-22228][SQL] Add support for array...

Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on the issue:

    https://github.com/apache/spark/pull/19492
  
    Kindly ping @viirya


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19492: [SPARK-22228][SQL] Add support for array...

Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on the issue:

    https://github.com/apache/spark/pull/19492
  
    @viirya The point is that most of the times the input format is not decided by the Spark users and I have seen some "provider of input sources" generating data in this format. It might not be the best, but it is valid. So, supporting it can greatly improve the user experience of the Spark user, who - otherwise - has to implement his own UDF to do the job.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19492#discussion_r144783193
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala ---
    @@ -343,6 +367,25 @@ class JacksonParser(
           record: T,
           createParser: (JsonFactory, T) => JsonParser,
           recordLiteral: T => UTF8String): Seq[InternalRow] = {
    +    parseWithArrayOfPrimitiveSupport(record, createParser, recordLiteral) match {
    +      case rows: Seq[InternalRow] => rows
    +      case _: Seq[_] => throw BadRecordException(() => recordLiteral(record), () => None,
    +        new RuntimeException("Conversion of array of primitive data is not yet supported here."))
    --- End diff --
    
    What about a user seeing this exception? With the current description (which I am very open to improve), he/she is aware that he/she is trying to do something which is not allowed (at least at the moment), ie. we might hit this exception when using `sqlContext.read.json(...)` on arrays of primitives. Your suggested description would be a bit weird to a user: he/she might feel he/she is doing something wrong to achieve something which can be done, but of course he/she knows nothing about these functions so he/she would be lost IMHO.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19492#discussion_r144784037
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala ---
    @@ -536,26 +536,31 @@ case class JsonToStructs(
           timeZoneId = None)
     
       override def checkInputDataTypes(): TypeCheckResult = schema match {
    -    case _: StructType | ArrayType(_: StructType, _) =>
    +    case _: StructType | ArrayType(_: StructType | _: AtomicType, _) =>
           super.checkInputDataTypes()
         case _ => TypeCheckResult.TypeCheckFailure(
    -      s"Input schema ${schema.simpleString} must be a struct or an array of structs.")
    +      s"Input schema ${schema.simpleString} must be a struct or " +
    +        s"an array of structs or primitive types.")
       }
     
       @transient
    -  lazy val rowSchema = schema match {
    +  lazy val rowSchema: DataType = schema match {
    --- End diff --
    
    Why is it not a row schema? It is, but sometimes the schema of a row is an array.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19492#discussion_r144790680
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala ---
    @@ -536,26 +536,31 @@ case class JsonToStructs(
           timeZoneId = None)
     
       override def checkInputDataTypes(): TypeCheckResult = schema match {
    -    case _: StructType | ArrayType(_: StructType, _) =>
    +    case _: StructType | ArrayType(_: StructType | _: AtomicType, _) =>
           super.checkInputDataTypes()
         case _ => TypeCheckResult.TypeCheckFailure(
    -      s"Input schema ${schema.simpleString} must be a struct or an array of structs.")
    +      s"Input schema ${schema.simpleString} must be a struct or " +
    +        s"an array of structs or primitive types.")
       }
     
       @transient
    -  lazy val rowSchema = schema match {
    +  lazy val rowSchema: DataType = schema match {
    --- End diff --
    
    I think it was row scheme because it can only be `StructType` before. This is not the input/output row's schema.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19492#discussion_r144789703
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala ---
    @@ -343,6 +367,25 @@ class JacksonParser(
           record: T,
           createParser: (JsonFactory, T) => JsonParser,
           recordLiteral: T => UTF8String): Seq[InternalRow] = {
    +    parseWithArrayOfPrimitiveSupport(record, createParser, recordLiteral) match {
    +      case rows: Seq[InternalRow] => rows
    +      case _: Seq[_] => throw BadRecordException(() => recordLiteral(record), () => None,
    +        new RuntimeException("Conversion of array of primitive data is not yet supported here."))
    --- End diff --
    
    Yes, it is internally used, but it it throws an `Exception` a user might see it. I think that if this is not clear for internal usage we can add comments, but the text of the exception should be meaningful to the end user IMHO. If you have any suggestion about how to improve this message keeping it meaningful to a user, I am happy to change it. Thanks.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19492#discussion_r144754775
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala ---
    @@ -89,6 +95,24 @@ class JacksonParser(
     
       /**
        * Create a converter which converts the JSON documents held by the `JsonParser`
    +   * to a value according to a desired schema. This is an overloaded method to the
    +   * previous one which allows to handle array of primitive types.
    +   */
    +  private def makeRootConverter(at: ArrayType): JsonParser => Seq[Any] = {
    +    (parser: JsonParser) => parseJsonToken[Seq[Any]](parser, at) {
    +      case START_ARRAY =>
    +        val array = convertArray(parser, makeConverter(at.elementType))
    --- End diff --
    
    Move `makeConverter` outside the inner function and so we can reuse it.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19492#discussion_r144754134
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala ---
    @@ -35,19 +35,25 @@ import org.apache.spark.util.Utils
     /**
      * Constructs a parser for a given schema that translates a json string to an [[InternalRow]].
      */
    -class JacksonParser(
    -    schema: StructType,
    +private[sql] class JacksonParser(
    +    schema: DataType,
         val options: JSONOptions) extends Logging {
     
       import JacksonUtils._
       import com.fasterxml.jackson.core.JsonToken._
     
    +  def this(schema: StructType, options: JSONOptions) = this(schema: DataType, options)
    +  def this(schema: ArrayType, options: JSONOptions) = this(schema: DataType, options)
    --- End diff --
    
    Are those necessary?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19492#discussion_r144785808
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala ---
    @@ -343,6 +367,25 @@ class JacksonParser(
           record: T,
           createParser: (JsonFactory, T) => JsonParser,
           recordLiteral: T => UTF8String): Seq[InternalRow] = {
    +    parseWithArrayOfPrimitiveSupport(record, createParser, recordLiteral) match {
    +      case rows: Seq[InternalRow] => rows
    +      case _: Seq[_] => throw BadRecordException(() => recordLiteral(record), () => None,
    +        new RuntimeException("Conversion of array of primitive data is not yet supported here."))
    --- End diff --
    
    What I thought is `JacksonParser` is internally used in Spark SQL.  It is hard to think an end user will directly use `parse` and see this exception.
    
    Actually `parse` is supposed to return `InternalRow`s. The case we get others is only because the given schema to `JacksonParser` is wrong. So I expect this exception is only seen at SQL development internally.
    
    Btw, I've no strong option at this point. If you think it is ok. I'm fine with it.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19492: [SPARK-22228][SQL] Add support for array...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/19492
  
    I am actually still hesitant. FWIW, there's another PR for this if I am not mistaken - `https://github.com/apache/spark/pull/21439`. I don't quite like that approach too but at least that has smaller change though. This can be of course useful but am hesitant about the approach or the amount of changes.
    
    Since I don't have a better idea, one of them could be merged by someone else. I don't object.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19492#discussion_r144783459
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/JsonFunctionsSuite.scala ---
    @@ -170,6 +160,31 @@ class JsonFunctionsSuite extends QueryTest with SharedSQLContext {
           Row(Row(1, "haa")) :: Nil)
       }
     
    +  test("SPARK-22228: from_json should support also arrays of primitive types") {
    +    val dfInt = Seq("[1]", "[2, 3]").toDS()
    --- End diff --
    
    This case is tested few lines after. I preferred to treat each specific case separately so that an error points out very easily which case is not handled properly.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19492#discussion_r144783499
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/JsonFunctionsSuite.scala ---
    @@ -170,6 +160,31 @@ class JsonFunctionsSuite extends QueryTest with SharedSQLContext {
           Row(Row(1, "haa")) :: Nil)
       }
     
    +  test("SPARK-22228: from_json should support also arrays of primitive types") {
    +    val dfInt = Seq("[1]", "[2, 3]").toDS()
    +    checkAnswer(
    +      dfInt.select(from_json($"value", ArrayType(IntegerType))),
    +      Row(Seq(1)) :: Row(Seq(2, 3)) :: Nil)
    +
    +    val dfString = Seq("""["hello", "world", ""]""").toDS()
    +    checkAnswer(
    +      dfString.select(from_json($"value", ArrayType(StringType))),
    +      Row(Seq("hello", "world", "")):: Nil)
    +
    +    val dfTimestamp = Seq("""["26/08/2015 18:00"]""").toDS()
    +    val schema = ArrayType(TimestampType)
    +    val options = Map("timestampFormat" -> "dd/MM/yyyy HH:mm")
    +
    +    checkAnswer(
    +      dfTimestamp.select(from_json($"value", schema, options)),
    +      Row(Seq(java.sql.Timestamp.valueOf("2015-08-26 18:00:00.0"))))
    +
    +    val dfEmpty = Seq("""[]""").toDS()
    +    checkAnswer(
    +      dfEmpty.select(from_json($"value", ArrayType(StringType))),
    +      Row(Nil):: Nil)
    --- End diff --
    
    I will, thanks for the suggestion.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19492: [SPARK-22228][SQL] Add support for array...

Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on the issue:

    https://github.com/apache/spark/pull/19492
  
    any more comments on this @viirya @HyukjinKwon?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19492: [SPARK-22228][SQL] Add support for array...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19492
  
    **[Test build #85507 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85507/testReport)** for PR 19492 at commit [`7e03f20`](https://github.com/apache/spark/commit/7e03f207c0e726264e257a6a0b0568ef55b3ca66).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19492: [SPARK-22228][SQL] Add support for array...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19492
  
    Build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19492#discussion_r144753618
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala ---
    @@ -536,26 +536,31 @@ case class JsonToStructs(
           timeZoneId = None)
     
       override def checkInputDataTypes(): TypeCheckResult = schema match {
    -    case _: StructType | ArrayType(_: StructType, _) =>
    +    case _: StructType | ArrayType(_: StructType | _: AtomicType, _) =>
           super.checkInputDataTypes()
         case _ => TypeCheckResult.TypeCheckFailure(
    -      s"Input schema ${schema.simpleString} must be a struct or an array of structs.")
    +      s"Input schema ${schema.simpleString} must be a struct or " +
    +        s"an array of structs or primitive types.")
       }
     
       @transient
    -  lazy val rowSchema = schema match {
    +  lazy val rowSchema: DataType = schema match {
    --- End diff --
    
    Not schema for row anymore. Maybe `dataSchema`?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19492: [SPARK-22228][SQL] Add support for array...

Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on the issue:

    https://github.com/apache/spark/pull/19492
  
    Closing as we have a newer PR (https://github.com/apache/spark/pull/21439) which uses the refactored classes (which made the change easier compared to this PR). Thanks.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19492#discussion_r144754632
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala ---
    @@ -35,19 +35,25 @@ import org.apache.spark.util.Utils
     /**
      * Constructs a parser for a given schema that translates a json string to an [[InternalRow]].
      */
    -class JacksonParser(
    -    schema: StructType,
    +private[sql] class JacksonParser(
    +    schema: DataType,
         val options: JSONOptions) extends Logging {
     
       import JacksonUtils._
       import com.fasterxml.jackson.core.JsonToken._
     
    +  def this(schema: StructType, options: JSONOptions) = this(schema: DataType, options)
    +  def this(schema: ArrayType, options: JSONOptions) = this(schema: DataType, options)
    +
       // A `ValueConverter` is responsible for converting a value from `JsonParser`
       // to a value in a field for `InternalRow`.
       private type ValueConverter = JsonParser => AnyRef
     
       // `ValueConverter`s for the root schema for all fields in the schema
    -  private val rootConverter = makeRootConverter(schema)
    +  private val rootConverter = schema match {
    +    case s: StructType => makeRootConverter(s)
    --- End diff --
    
    It is kind of easy to confused. Please add comment to each case like:
    ```scala
    private val rootConverter = schema match {
        case s: StructType => makeRootConverter(s) // For struct or array of struct.
        case a: ArrayType => makeRootConverter(a)  // For array of primitive types.
    }
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19492: [SPARK-22228][SQL] Add support for array...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19492
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85516/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19492#discussion_r144757046
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala ---
    @@ -343,6 +367,25 @@ class JacksonParser(
           record: T,
           createParser: (JsonFactory, T) => JsonParser,
           recordLiteral: T => UTF8String): Seq[InternalRow] = {
    +    parseWithArrayOfPrimitiveSupport(record, createParser, recordLiteral) match {
    +      case rows: Seq[InternalRow] => rows
    +      case _: Seq[_] => throw BadRecordException(() => recordLiteral(record), () => None,
    +        new RuntimeException("Conversion of array of primitive data is not yet supported here."))
    --- End diff --
    
    This exception looks a bit weird. How about `` `parse` is only used to parse the JSON input to the set of `InternalRow`s. Use `parseWithArrayOfPrimitiveSupport` when paring array of primitive data is needed``?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19492: [SPARK-22228][SQL] Add support for array...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19492
  
    **[Test build #85507 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85507/testReport)** for PR 19492 at commit [`7e03f20`](https://github.com/apache/spark/commit/7e03f207c0e726264e257a6a0b0568ef55b3ca66).
     * This patch **fails to build**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19492#discussion_r144836485
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala ---
    @@ -35,19 +35,25 @@ import org.apache.spark.util.Utils
     /**
      * Constructs a parser for a given schema that translates a json string to an [[InternalRow]].
      */
    -class JacksonParser(
    -    schema: StructType,
    +private[sql] class JacksonParser(
    +    schema: DataType,
         val options: JSONOptions) extends Logging {
     
       import JacksonUtils._
       import com.fasterxml.jackson.core.JsonToken._
     
    +  def this(schema: StructType, options: JSONOptions) = this(schema: DataType, options)
    +  def this(schema: ArrayType, options: JSONOptions) = this(schema: DataType, options)
    --- End diff --
    
    you are right, I am fixing it, thanks.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19492: [SPARK-22228][SQL] Add support for array...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19492
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19492#discussion_r144837088
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala ---
    @@ -343,6 +367,25 @@ class JacksonParser(
           record: T,
           createParser: (JsonFactory, T) => JsonParser,
           recordLiteral: T => UTF8String): Seq[InternalRow] = {
    +    parseWithArrayOfPrimitiveSupport(record, createParser, recordLiteral) match {
    +      case rows: Seq[InternalRow] => rows
    +      case _: Seq[_] => throw BadRecordException(() => recordLiteral(record), () => None,
    +        new RuntimeException("Conversion of array of primitive data is not yet supported here."))
    --- End diff --
    
    Anyway, we can keep it as it is. I didn't feel strongly to change it.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19492: [SPARK-22228][SQL] Add support for array...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/19492
  
    I will be busy on relocating in next few days. I'll look into this further if I have time in maybe weekend.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19492#discussion_r144755348
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala ---
    @@ -89,6 +95,24 @@ class JacksonParser(
     
       /**
        * Create a converter which converts the JSON documents held by the `JsonParser`
    +   * to a value according to a desired schema. This is an overloaded method to the
    +   * previous one which allows to handle array of primitive types.
    +   */
    +  private def makeRootConverter(at: ArrayType): JsonParser => Seq[Any] = {
    +    (parser: JsonParser) => parseJsonToken[Seq[Any]](parser, at) {
    +      case START_ARRAY =>
    +        val array = convertArray(parser, makeConverter(at.elementType))
    +        if (array.numElements() == 0) {
    +          Nil
    +        } else {
    +          array.toArray(at.elementType).toSeq
    +        }
    +      case _ => Nil
    --- End diff --
    
    Should we return `Nil` when it is not parsed to array? The original `makeRootConverter` didn't do this.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19492: [SPARK-22228][SQL] Add support for array...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19492
  
    **[Test build #85516 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85516/testReport)** for PR 19492 at commit [`29d6b96`](https://github.com/apache/spark/commit/29d6b96f0d118f2fac3c80b7a5946c83bfca480b).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19492#discussion_r144754004
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala ---
    @@ -35,19 +35,25 @@ import org.apache.spark.util.Utils
     /**
      * Constructs a parser for a given schema that translates a json string to an [[InternalRow]].
    --- End diff --
    
    After this change, it didn't always return `InternalRow`.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19492#discussion_r144834116
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala ---
    @@ -35,19 +35,25 @@ import org.apache.spark.util.Utils
     /**
      * Constructs a parser for a given schema that translates a json string to an [[InternalRow]].
      */
    -class JacksonParser(
    -    schema: StructType,
    +private[sql] class JacksonParser(
    +    schema: DataType,
         val options: JSONOptions) extends Logging {
     
       import JacksonUtils._
       import com.fasterxml.jackson.core.JsonToken._
     
    +  def this(schema: StructType, options: JSONOptions) = this(schema: DataType, options)
    +  def this(schema: ArrayType, options: JSONOptions) = this(schema: DataType, options)
    --- End diff --
    
    If so, then I think the default constructor should be private?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19492#discussion_r144837359
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala ---
    @@ -343,6 +368,25 @@ class JacksonParser(
           record: T,
           createParser: (JsonFactory, T) => JsonParser,
           recordLiteral: T => UTF8String): Seq[InternalRow] = {
    +    parseWithArrayOfPrimitiveSupport(record, createParser, recordLiteral) match {
    +      case rows: Seq[InternalRow] => rows
    +      case _: Seq[_] => throw BadRecordException(() => recordLiteral(record), () => None,
    +        new RuntimeException("Conversion of array of primitive data is not yet supported here."))
    +    }
    +  }
    +
    +  /**
    +   * Parse the JSON input. This function can return a set of [[InternalRow]]s
    +   * if a [[StructType]] is defined as schema, otherwise it returns a set of
    +   * objects.
    --- End diff --
    
    Btw, and also add comment to existing `parse` to clarify its usage. It might not easily to know which to call at the first glance.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19492: [SPARK-22228][SQL] Add support for array...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19492
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85507/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org