Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on a diff in the pull request:
https://github.com/apache/spark/pull/19492#discussion_r144837336
--- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala ---
@@ -343,6 +367,25 @@ class JacksonParser(
record: T,
createParser: (JsonFactory, T) => JsonParser,
recordLiteral: T => UTF8String): Seq[InternalRow] = {
+ parseWithArrayOfPrimitiveSupport(record, createParser, recordLiteral) match {
+ case rows: Seq[InternalRow] => rows
+ case _: Seq[_] => throw BadRecordException(() => recordLiteral(record), () => None,
+ new RuntimeException("Conversion of array of primitive data is not yet supported here."))
--- End diff --
yes, you are right since we have only two constructors which enforce this patters. Then I will edit the exception message according to your suggestion, thanks.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19492: [SPARK-22228][SQL] Add support for array...
Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/19492
ok to test
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19492: [SPARK-22228][SQL] Add support for array...
Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on the issue:
https://github.com/apache/spark/pull/19492
@HyukjinKwon as there were questions about whether this PR is useful or not, what do you think? Shall we go on on this or shall I close it? Thanks.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19492: [SPARK-22228][SQL] Add support for array...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19492
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19492: [SPARK-22228][SQL] Add support for array...
Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/19492
To be honest, just to share what have been on my mind on this so far, I have been trying to think if it's worth vs the amount of changes, in particular, we now happen to have some divergence, `makeRootConverter ` / `parseWithArrayOfPrimitiveSupport`. I know why they had to be (probably I guess @viirya is aware of it too assuming previous discussion in another PR).
Will try to take a look few times more and will share my feedback in near future. Meanwhile, @viirya WDYT on this PR?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19492: [SPARK-22228][SQL] Add support for array...
Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on the issue:
https://github.com/apache/spark/pull/19492
kindly ping @viirya @HyukjinKwon
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19492: [SPARK-22228][SQL] Add support for array...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19492
**[Test build #85516 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85516/testReport)** for PR 19492 at commit [`29d6b96`](https://github.com/apache/spark/commit/29d6b96f0d118f2fac3c80b7a5946c83bfca480b).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array
Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:
https://github.com/apache/spark/pull/19492#discussion_r144837080
--- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala ---
@@ -343,6 +368,25 @@ class JacksonParser(
record: T,
createParser: (JsonFactory, T) => JsonParser,
recordLiteral: T => UTF8String): Seq[InternalRow] = {
+ parseWithArrayOfPrimitiveSupport(record, createParser, recordLiteral) match {
+ case rows: Seq[InternalRow] => rows
+ case _: Seq[_] => throw BadRecordException(() => recordLiteral(record), () => None,
+ new RuntimeException("Conversion of array of primitive data is not yet supported here."))
+ }
+ }
+
+ /**
+ * Parse the JSON input. This function can return a set of [[InternalRow]]s
+ * if a [[StructType]] is defined as schema, otherwise it returns a set of
+ * objects.
--- End diff --
Please add comment that this is used when passing `ArrayType` of primitive types.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19492: [SPARK-22228][SQL] Add support for array...
Posted by felixcheung <gi...@git.apache.org>.
Github user felixcheung commented on the issue:
https://github.com/apache/spark/pull/19492
@HyukjinKwon
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19492: [SPARK-22228][SQL] Add support for array...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19492
Merged build finished. Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array
Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on a diff in the pull request:
https://github.com/apache/spark/pull/19492#discussion_r144783859
--- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala ---
@@ -35,19 +35,25 @@ import org.apache.spark.util.Utils
/**
* Constructs a parser for a given schema that translates a json string to an [[InternalRow]].
*/
-class JacksonParser(
- schema: StructType,
+private[sql] class JacksonParser(
+ schema: DataType,
val options: JSONOptions) extends Logging {
import JacksonUtils._
import com.fasterxml.jackson.core.JsonToken._
+ def this(schema: StructType, options: JSONOptions) = this(schema: DataType, options)
+ def this(schema: ArrayType, options: JSONOptions) = this(schema: DataType, options)
--- End diff --
These are to avoid that someone use the constructor specifying invalid `DataType`, ie. anything which is not a `StructType` or `ArrayType`.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19492: [SPARK-22228][SQL] Add support for array...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19492
**[Test build #85509 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85509/testReport)** for PR 19492 at commit [`e29226b`](https://github.com/apache/spark/commit/e29226bd04f047e28060616c5db4d9df6cdd3090).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array
Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 closed the pull request at:
https://github.com/apache/spark/pull/19492
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19492: [SPARK-22228][SQL] Add support for array...
Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on the issue:
https://github.com/apache/spark/pull/19492
@viirya did you have any chance to look at this? Thanks.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19492: [SPARK-22228][SQL] Add support for array...
Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/19492
In fact I don't feel strong preference to support array of primitive types in `from_json`.
The reason is that I think at most time, we will use json object instead of json array at top-level. Is it common to use json array to store primitive type of data like this? Seems to me it doesn't have more advantage than just a string with special delimiter such as "," or "|".
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19492: [SPARK-22228][SQL] Add support for array...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19492
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/377/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19492: [SPARK-22228][SQL] Add support for array...
Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on the issue:
https://github.com/apache/spark/pull/19492
@viirya sorry, do you have any more comments?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array
Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:
https://github.com/apache/spark/pull/19492#discussion_r144753534
--- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala ---
@@ -536,26 +536,31 @@ case class JsonToStructs(
timeZoneId = None)
override def checkInputDataTypes(): TypeCheckResult = schema match {
- case _: StructType | ArrayType(_: StructType, _) =>
+ case _: StructType | ArrayType(_: StructType | _: AtomicType, _) =>
super.checkInputDataTypes()
case _ => TypeCheckResult.TypeCheckFailure(
- s"Input schema ${schema.simpleString} must be a struct or an array of structs.")
+ s"Input schema ${schema.simpleString} must be a struct or " +
+ s"an array of structs or primitive types.")
}
@transient
- lazy val rowSchema = schema match {
+ lazy val rowSchema: DataType = schema match {
case st: StructType => st
case ArrayType(st: StructType, _) => st
+ case ArrayType(at: AtomicType, _) => ArrayType(at)
}
// This converts parsed rows to the desired output by the given schema.
@transient
- lazy val converter = schema match {
- case _: StructType =>
- (rows: Seq[InternalRow]) => if (rows.length == 1) rows.head else null
- case ArrayType(_: StructType, _) =>
- (rows: Seq[InternalRow]) => new GenericArrayData(rows)
- }
+ lazy val converter = (rows: Seq[Any]) =>
--- End diff --
This brings extra matching cost at runtime. Can we move matching outside?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array
Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on a diff in the pull request:
https://github.com/apache/spark/pull/19492#discussion_r144782180
--- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala ---
@@ -89,6 +95,24 @@ class JacksonParser(
/**
* Create a converter which converts the JSON documents held by the `JsonParser`
+ * to a value according to a desired schema. This is an overloaded method to the
+ * previous one which allows to handle array of primitive types.
+ */
+ private def makeRootConverter(at: ArrayType): JsonParser => Seq[Any] = {
+ (parser: JsonParser) => parseJsonToken[Seq[Any]](parser, at) {
+ case START_ARRAY =>
+ val array = convertArray(parser, makeConverter(at.elementType))
+ if (array.numElements() == 0) {
+ Nil
+ } else {
+ array.toArray(at.elementType).toSeq
+ }
+ case _ => Nil
--- End diff --
You are right, I will remove this.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19492: [SPARK-22228][SQL] Add support for array...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19492
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85509/
Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array
Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:
https://github.com/apache/spark/pull/19492#discussion_r144757908
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/JsonFunctionsSuite.scala ---
@@ -170,6 +160,31 @@ class JsonFunctionsSuite extends QueryTest with SharedSQLContext {
Row(Row(1, "haa")) :: Nil)
}
+ test("SPARK-22228: from_json should support also arrays of primitive types") {
+ val dfInt = Seq("[1]", "[2, 3]").toDS()
+ checkAnswer(
+ dfInt.select(from_json($"value", ArrayType(IntegerType))),
+ Row(Seq(1)) :: Row(Seq(2, 3)) :: Nil)
+
+ val dfString = Seq("""["hello", "world", ""]""").toDS()
+ checkAnswer(
+ dfString.select(from_json($"value", ArrayType(StringType))),
+ Row(Seq("hello", "world", "")):: Nil)
+
+ val dfTimestamp = Seq("""["26/08/2015 18:00"]""").toDS()
+ val schema = ArrayType(TimestampType)
+ val options = Map("timestampFormat" -> "dd/MM/yyyy HH:mm")
+
+ checkAnswer(
+ dfTimestamp.select(from_json($"value", schema, options)),
+ Row(Seq(java.sql.Timestamp.valueOf("2015-08-26 18:00:00.0"))))
+
+ val dfEmpty = Seq("""[]""").toDS()
+ checkAnswer(
+ dfEmpty.select(from_json($"value", ArrayType(StringType))),
+ Row(Nil):: Nil)
--- End diff --
And also add a case that can fail the parsing like `Seq(""""[1]", "[2, 3]", "[string]""").toDS()`?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19492: [SPARK-22228][SQL] Add support for array...
Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on the issue:
https://github.com/apache/spark/pull/19492
Kindly ping @viirya
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19492: [SPARK-22228][SQL] Add support for array...
Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on the issue:
https://github.com/apache/spark/pull/19492
@viirya The point is that most of the times the input format is not decided by the Spark users and I have seen some "provider of input sources" generating data in this format. It might not be the best, but it is valid. So, supporting it can greatly improve the user experience of the Spark user, who - otherwise - has to implement his own UDF to do the job.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array
Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on a diff in the pull request:
https://github.com/apache/spark/pull/19492#discussion_r144783193
--- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala ---
@@ -343,6 +367,25 @@ class JacksonParser(
record: T,
createParser: (JsonFactory, T) => JsonParser,
recordLiteral: T => UTF8String): Seq[InternalRow] = {
+ parseWithArrayOfPrimitiveSupport(record, createParser, recordLiteral) match {
+ case rows: Seq[InternalRow] => rows
+ case _: Seq[_] => throw BadRecordException(() => recordLiteral(record), () => None,
+ new RuntimeException("Conversion of array of primitive data is not yet supported here."))
--- End diff --
What about a user seeing this exception? With the current description (which I am very open to improve), he/she is aware that he/she is trying to do something which is not allowed (at least at the moment), ie. we might hit this exception when using `sqlContext.read.json(...)` on arrays of primitives. Your suggested description would be a bit weird to a user: he/she might feel he/she is doing something wrong to achieve something which can be done, but of course he/she knows nothing about these functions so he/she would be lost IMHO.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array
Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on a diff in the pull request:
https://github.com/apache/spark/pull/19492#discussion_r144784037
--- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala ---
@@ -536,26 +536,31 @@ case class JsonToStructs(
timeZoneId = None)
override def checkInputDataTypes(): TypeCheckResult = schema match {
- case _: StructType | ArrayType(_: StructType, _) =>
+ case _: StructType | ArrayType(_: StructType | _: AtomicType, _) =>
super.checkInputDataTypes()
case _ => TypeCheckResult.TypeCheckFailure(
- s"Input schema ${schema.simpleString} must be a struct or an array of structs.")
+ s"Input schema ${schema.simpleString} must be a struct or " +
+ s"an array of structs or primitive types.")
}
@transient
- lazy val rowSchema = schema match {
+ lazy val rowSchema: DataType = schema match {
--- End diff --
Why is it not a row schema? It is, but sometimes the schema of a row is an array.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array
Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:
https://github.com/apache/spark/pull/19492#discussion_r144790680
--- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala ---
@@ -536,26 +536,31 @@ case class JsonToStructs(
timeZoneId = None)
override def checkInputDataTypes(): TypeCheckResult = schema match {
- case _: StructType | ArrayType(_: StructType, _) =>
+ case _: StructType | ArrayType(_: StructType | _: AtomicType, _) =>
super.checkInputDataTypes()
case _ => TypeCheckResult.TypeCheckFailure(
- s"Input schema ${schema.simpleString} must be a struct or an array of structs.")
+ s"Input schema ${schema.simpleString} must be a struct or " +
+ s"an array of structs or primitive types.")
}
@transient
- lazy val rowSchema = schema match {
+ lazy val rowSchema: DataType = schema match {
--- End diff --
I think it was row scheme because it can only be `StructType` before. This is not the input/output row's schema.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array
Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on a diff in the pull request:
https://github.com/apache/spark/pull/19492#discussion_r144789703
--- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala ---
@@ -343,6 +367,25 @@ class JacksonParser(
record: T,
createParser: (JsonFactory, T) => JsonParser,
recordLiteral: T => UTF8String): Seq[InternalRow] = {
+ parseWithArrayOfPrimitiveSupport(record, createParser, recordLiteral) match {
+ case rows: Seq[InternalRow] => rows
+ case _: Seq[_] => throw BadRecordException(() => recordLiteral(record), () => None,
+ new RuntimeException("Conversion of array of primitive data is not yet supported here."))
--- End diff --
Yes, it is internally used, but it it throws an `Exception` a user might see it. I think that if this is not clear for internal usage we can add comments, but the text of the exception should be meaningful to the end user IMHO. If you have any suggestion about how to improve this message keeping it meaningful to a user, I am happy to change it. Thanks.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array
Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:
https://github.com/apache/spark/pull/19492#discussion_r144754775
--- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala ---
@@ -89,6 +95,24 @@ class JacksonParser(
/**
* Create a converter which converts the JSON documents held by the `JsonParser`
+ * to a value according to a desired schema. This is an overloaded method to the
+ * previous one which allows to handle array of primitive types.
+ */
+ private def makeRootConverter(at: ArrayType): JsonParser => Seq[Any] = {
+ (parser: JsonParser) => parseJsonToken[Seq[Any]](parser, at) {
+ case START_ARRAY =>
+ val array = convertArray(parser, makeConverter(at.elementType))
--- End diff --
Move `makeConverter` outside the inner function and so we can reuse it.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array
Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:
https://github.com/apache/spark/pull/19492#discussion_r144754134
--- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala ---
@@ -35,19 +35,25 @@ import org.apache.spark.util.Utils
/**
* Constructs a parser for a given schema that translates a json string to an [[InternalRow]].
*/
-class JacksonParser(
- schema: StructType,
+private[sql] class JacksonParser(
+ schema: DataType,
val options: JSONOptions) extends Logging {
import JacksonUtils._
import com.fasterxml.jackson.core.JsonToken._
+ def this(schema: StructType, options: JSONOptions) = this(schema: DataType, options)
+ def this(schema: ArrayType, options: JSONOptions) = this(schema: DataType, options)
--- End diff --
Are those necessary?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array
Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:
https://github.com/apache/spark/pull/19492#discussion_r144785808
--- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala ---
@@ -343,6 +367,25 @@ class JacksonParser(
record: T,
createParser: (JsonFactory, T) => JsonParser,
recordLiteral: T => UTF8String): Seq[InternalRow] = {
+ parseWithArrayOfPrimitiveSupport(record, createParser, recordLiteral) match {
+ case rows: Seq[InternalRow] => rows
+ case _: Seq[_] => throw BadRecordException(() => recordLiteral(record), () => None,
+ new RuntimeException("Conversion of array of primitive data is not yet supported here."))
--- End diff --
What I thought is `JacksonParser` is internally used in Spark SQL. It is hard to think an end user will directly use `parse` and see this exception.
Actually `parse` is supposed to return `InternalRow`s. The case we get others is only because the given schema to `JacksonParser` is wrong. So I expect this exception is only seen at SQL development internally.
Btw, I've no strong option at this point. If you think it is ok. I'm fine with it.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19492: [SPARK-22228][SQL] Add support for array...
Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/19492
I am actually still hesitant. FWIW, there's another PR for this if I am not mistaken - `https://github.com/apache/spark/pull/21439`. I don't quite like that approach too but at least that has smaller change though. This can be of course useful but am hesitant about the approach or the amount of changes.
Since I don't have a better idea, one of them could be merged by someone else. I don't object.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array
Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on a diff in the pull request:
https://github.com/apache/spark/pull/19492#discussion_r144783459
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/JsonFunctionsSuite.scala ---
@@ -170,6 +160,31 @@ class JsonFunctionsSuite extends QueryTest with SharedSQLContext {
Row(Row(1, "haa")) :: Nil)
}
+ test("SPARK-22228: from_json should support also arrays of primitive types") {
+ val dfInt = Seq("[1]", "[2, 3]").toDS()
--- End diff --
This case is tested few lines after. I preferred to treat each specific case separately so that an error points out very easily which case is not handled properly.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array
Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on a diff in the pull request:
https://github.com/apache/spark/pull/19492#discussion_r144783499
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/JsonFunctionsSuite.scala ---
@@ -170,6 +160,31 @@ class JsonFunctionsSuite extends QueryTest with SharedSQLContext {
Row(Row(1, "haa")) :: Nil)
}
+ test("SPARK-22228: from_json should support also arrays of primitive types") {
+ val dfInt = Seq("[1]", "[2, 3]").toDS()
+ checkAnswer(
+ dfInt.select(from_json($"value", ArrayType(IntegerType))),
+ Row(Seq(1)) :: Row(Seq(2, 3)) :: Nil)
+
+ val dfString = Seq("""["hello", "world", ""]""").toDS()
+ checkAnswer(
+ dfString.select(from_json($"value", ArrayType(StringType))),
+ Row(Seq("hello", "world", "")):: Nil)
+
+ val dfTimestamp = Seq("""["26/08/2015 18:00"]""").toDS()
+ val schema = ArrayType(TimestampType)
+ val options = Map("timestampFormat" -> "dd/MM/yyyy HH:mm")
+
+ checkAnswer(
+ dfTimestamp.select(from_json($"value", schema, options)),
+ Row(Seq(java.sql.Timestamp.valueOf("2015-08-26 18:00:00.0"))))
+
+ val dfEmpty = Seq("""[]""").toDS()
+ checkAnswer(
+ dfEmpty.select(from_json($"value", ArrayType(StringType))),
+ Row(Nil):: Nil)
--- End diff --
I will, thanks for the suggestion.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19492: [SPARK-22228][SQL] Add support for array...
Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on the issue:
https://github.com/apache/spark/pull/19492
any more comments on this @viirya @HyukjinKwon?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19492: [SPARK-22228][SQL] Add support for array...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19492
**[Test build #85507 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85507/testReport)** for PR 19492 at commit [`7e03f20`](https://github.com/apache/spark/commit/7e03f207c0e726264e257a6a0b0568ef55b3ca66).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19492: [SPARK-22228][SQL] Add support for array...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19492
Build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array
Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:
https://github.com/apache/spark/pull/19492#discussion_r144753618
--- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala ---
@@ -536,26 +536,31 @@ case class JsonToStructs(
timeZoneId = None)
override def checkInputDataTypes(): TypeCheckResult = schema match {
- case _: StructType | ArrayType(_: StructType, _) =>
+ case _: StructType | ArrayType(_: StructType | _: AtomicType, _) =>
super.checkInputDataTypes()
case _ => TypeCheckResult.TypeCheckFailure(
- s"Input schema ${schema.simpleString} must be a struct or an array of structs.")
+ s"Input schema ${schema.simpleString} must be a struct or " +
+ s"an array of structs or primitive types.")
}
@transient
- lazy val rowSchema = schema match {
+ lazy val rowSchema: DataType = schema match {
--- End diff --
Not schema for row anymore. Maybe `dataSchema`?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19492: [SPARK-22228][SQL] Add support for array...
Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on the issue:
https://github.com/apache/spark/pull/19492
Closing as we have a newer PR (https://github.com/apache/spark/pull/21439) which uses the refactored classes (which made the change easier compared to this PR). Thanks.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array
Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:
https://github.com/apache/spark/pull/19492#discussion_r144754632
--- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala ---
@@ -35,19 +35,25 @@ import org.apache.spark.util.Utils
/**
* Constructs a parser for a given schema that translates a json string to an [[InternalRow]].
*/
-class JacksonParser(
- schema: StructType,
+private[sql] class JacksonParser(
+ schema: DataType,
val options: JSONOptions) extends Logging {
import JacksonUtils._
import com.fasterxml.jackson.core.JsonToken._
+ def this(schema: StructType, options: JSONOptions) = this(schema: DataType, options)
+ def this(schema: ArrayType, options: JSONOptions) = this(schema: DataType, options)
+
// A `ValueConverter` is responsible for converting a value from `JsonParser`
// to a value in a field for `InternalRow`.
private type ValueConverter = JsonParser => AnyRef
// `ValueConverter`s for the root schema for all fields in the schema
- private val rootConverter = makeRootConverter(schema)
+ private val rootConverter = schema match {
+ case s: StructType => makeRootConverter(s)
--- End diff --
It is kind of easy to confused. Please add comment to each case like:
```scala
private val rootConverter = schema match {
case s: StructType => makeRootConverter(s) // For struct or array of struct.
case a: ArrayType => makeRootConverter(a) // For array of primitive types.
}
```
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19492: [SPARK-22228][SQL] Add support for array...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19492
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85516/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array
Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:
https://github.com/apache/spark/pull/19492#discussion_r144757046
--- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala ---
@@ -343,6 +367,25 @@ class JacksonParser(
record: T,
createParser: (JsonFactory, T) => JsonParser,
recordLiteral: T => UTF8String): Seq[InternalRow] = {
+ parseWithArrayOfPrimitiveSupport(record, createParser, recordLiteral) match {
+ case rows: Seq[InternalRow] => rows
+ case _: Seq[_] => throw BadRecordException(() => recordLiteral(record), () => None,
+ new RuntimeException("Conversion of array of primitive data is not yet supported here."))
--- End diff --
This exception looks a bit weird. How about `` `parse` is only used to parse the JSON input to the set of `InternalRow`s. Use `parseWithArrayOfPrimitiveSupport` when paring array of primitive data is needed``?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19492: [SPARK-22228][SQL] Add support for array...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19492
**[Test build #85507 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85507/testReport)** for PR 19492 at commit [`7e03f20`](https://github.com/apache/spark/commit/7e03f207c0e726264e257a6a0b0568ef55b3ca66).
* This patch **fails to build**.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array
Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on a diff in the pull request:
https://github.com/apache/spark/pull/19492#discussion_r144836485
--- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala ---
@@ -35,19 +35,25 @@ import org.apache.spark.util.Utils
/**
* Constructs a parser for a given schema that translates a json string to an [[InternalRow]].
*/
-class JacksonParser(
- schema: StructType,
+private[sql] class JacksonParser(
+ schema: DataType,
val options: JSONOptions) extends Logging {
import JacksonUtils._
import com.fasterxml.jackson.core.JsonToken._
+ def this(schema: StructType, options: JSONOptions) = this(schema: DataType, options)
+ def this(schema: ArrayType, options: JSONOptions) = this(schema: DataType, options)
--- End diff --
you are right, I am fixing it, thanks.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19492: [SPARK-22228][SQL] Add support for array...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19492
Merged build finished. Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array
Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:
https://github.com/apache/spark/pull/19492#discussion_r144837088
--- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala ---
@@ -343,6 +367,25 @@ class JacksonParser(
record: T,
createParser: (JsonFactory, T) => JsonParser,
recordLiteral: T => UTF8String): Seq[InternalRow] = {
+ parseWithArrayOfPrimitiveSupport(record, createParser, recordLiteral) match {
+ case rows: Seq[InternalRow] => rows
+ case _: Seq[_] => throw BadRecordException(() => recordLiteral(record), () => None,
+ new RuntimeException("Conversion of array of primitive data is not yet supported here."))
--- End diff --
Anyway, we can keep it as it is. I didn't feel strongly to change it.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19492: [SPARK-22228][SQL] Add support for array...
Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/19492
I will be busy on relocating in next few days. I'll look into this further if I have time in maybe weekend.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array
Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:
https://github.com/apache/spark/pull/19492#discussion_r144755348
--- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala ---
@@ -89,6 +95,24 @@ class JacksonParser(
/**
* Create a converter which converts the JSON documents held by the `JsonParser`
+ * to a value according to a desired schema. This is an overloaded method to the
+ * previous one which allows to handle array of primitive types.
+ */
+ private def makeRootConverter(at: ArrayType): JsonParser => Seq[Any] = {
+ (parser: JsonParser) => parseJsonToken[Seq[Any]](parser, at) {
+ case START_ARRAY =>
+ val array = convertArray(parser, makeConverter(at.elementType))
+ if (array.numElements() == 0) {
+ Nil
+ } else {
+ array.toArray(at.elementType).toSeq
+ }
+ case _ => Nil
--- End diff --
Should we return `Nil` when it is not parsed to array? The original `makeRootConverter` didn't do this.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19492: [SPARK-22228][SQL] Add support for array...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19492
**[Test build #85516 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85516/testReport)** for PR 19492 at commit [`29d6b96`](https://github.com/apache/spark/commit/29d6b96f0d118f2fac3c80b7a5946c83bfca480b).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array
Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:
https://github.com/apache/spark/pull/19492#discussion_r144754004
--- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala ---
@@ -35,19 +35,25 @@ import org.apache.spark.util.Utils
/**
* Constructs a parser for a given schema that translates a json string to an [[InternalRow]].
--- End diff --
After this change, it didn't always return `InternalRow`.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array
Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:
https://github.com/apache/spark/pull/19492#discussion_r144834116
--- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala ---
@@ -35,19 +35,25 @@ import org.apache.spark.util.Utils
/**
* Constructs a parser for a given schema that translates a json string to an [[InternalRow]].
*/
-class JacksonParser(
- schema: StructType,
+private[sql] class JacksonParser(
+ schema: DataType,
val options: JSONOptions) extends Logging {
import JacksonUtils._
import com.fasterxml.jackson.core.JsonToken._
+ def this(schema: StructType, options: JSONOptions) = this(schema: DataType, options)
+ def this(schema: ArrayType, options: JSONOptions) = this(schema: DataType, options)
--- End diff --
If so, then I think the default constructor should be private?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array
Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:
https://github.com/apache/spark/pull/19492#discussion_r144837359
--- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala ---
@@ -343,6 +368,25 @@ class JacksonParser(
record: T,
createParser: (JsonFactory, T) => JsonParser,
recordLiteral: T => UTF8String): Seq[InternalRow] = {
+ parseWithArrayOfPrimitiveSupport(record, createParser, recordLiteral) match {
+ case rows: Seq[InternalRow] => rows
+ case _: Seq[_] => throw BadRecordException(() => recordLiteral(record), () => None,
+ new RuntimeException("Conversion of array of primitive data is not yet supported here."))
+ }
+ }
+
+ /**
+ * Parse the JSON input. This function can return a set of [[InternalRow]]s
+ * if a [[StructType]] is defined as schema, otherwise it returns a set of
+ * objects.
--- End diff --
Btw, and also add comment to existing `parse` to clarify its usage. It might not easily to know which to call at the first glance.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19492: [SPARK-22228][SQL] Add support for array...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19492
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85507/
Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org