You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by dongjoon-hyun <gi...@git.apache.org> on 2017/10/11 03:54:01 UTC
[GitHub] spark pull request #19470: [SPARK-18355][SQL] Use Spark schema to read ORC t...
GitHub user dongjoon-hyun opened a pull request:
https://github.com/apache/spark/pull/19470
[SPARK-18355][SQL] Use Spark schema to read ORC table instead of ORC file schema
## What changes were proposed in this pull request?
Before Hive 2.0, ORC File schema has invalid column names like `_col1` and `col2`. This PR ignores ORC File schema and use Spark schema.
## How was this patch tested?
Pass the newly added test case.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/dongjoon-hyun/spark SPARK-18355
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/19470.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #19470
----
commit d11ce09d07f36f39c4a8a92c4f0ff0f19a35a724
Author: Dongjoon Hyun <do...@apache.org>
Date: 2017-10-11T02:12:31Z
[SPARK-18355][SQL] Use Spark schema to read ORC table instead of ORC file schema
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19470: [SPARK-14387][SPARK-16628][SPARK-18355][SQL] Use ...
Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:
https://github.com/apache/spark/pull/19470#discussion_r144465575
--- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala ---
@@ -2050,4 +2050,60 @@ class SQLQuerySuite extends QueryTest with SQLTestUtils with TestHiveSingleton {
}
}
}
+
+ test("SPARK-18355 Use Spark schema to read ORC table instead of ORC file schema") {
--- End diff --
since it depends on the CONVERT_METASTORE_XXX conf, maybe also test parquet.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19470: [SPARK-14387][SPARK-16628][SPARK-18355][SQL] Use ...
Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on a diff in the pull request:
https://github.com/apache/spark/pull/19470#discussion_r144466637
--- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala ---
@@ -2050,4 +2050,60 @@ class SQLQuerySuite extends QueryTest with SQLTestUtils with TestHiveSingleton {
}
}
}
+
+ test("SPARK-18355 Use Spark schema to read ORC table instead of ORC file schema") {
--- End diff --
Yep. I'll add `parquet`, too.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19470: [SPARK-18355][SQL] Use Spark schema to read ORC table in...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19470
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82620/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19470: [SPARK-14387][SPARK-18355][SQL] Use Spark schema to read...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19470
**[Test build #82701 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82701/testReport)** for PR 19470 at commit [`8ac1acf`](https://github.com/apache/spark/commit/8ac1acfad74eba27a7123e04ca14682bff59a20b).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19470: [SPARK-14387][SPARK-16628][SPARK-18355][SQL] Use Spark s...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19470
Merged build finished. Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19470: [SPARK-14387][SPARK-18355][SQL] Use Spark schema to read...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19470
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82701/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19470: [SPARK-14387][SPARK-16628][SPARK-18355][SQL] Use Spark s...
Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on the issue:
https://github.com/apache/spark/pull/19470
Retest this please.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19470: [SPARK-14387][SPARK-18355][SQL] Use Spark schema ...
Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on a diff in the pull request:
https://github.com/apache/spark/pull/19470#discussion_r144387023
--- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala ---
@@ -2050,4 +2050,80 @@ class SQLQuerySuite extends QueryTest with SQLTestUtils with TestHiveSingleton {
}
}
}
+
+ test("SPARK-18355 Use Spark schema to read ORC table instead of ORC file schema") {
+ val client = spark.sharedState.externalCatalog.asInstanceOf[HiveExternalCatalog].client
+
+ Seq("true", "false").foreach { value =>
+ withSQLConf(HiveUtils.CONVERT_METASTORE_ORC.key -> value) {
+ withTempDatabase { db =>
+ client.runSqlHive(
+ s"""
+ |CREATE TABLE $db.t(
+ | click_id string,
+ | search_id string,
+ | uid bigint)
+ |PARTITIONED BY (
+ | ts string,
+ | hour string)
+ |STORED AS ORC
+ """.stripMargin)
+
+ client.runSqlHive(
+ s"""
+ |INSERT INTO TABLE $db.t
+ |PARTITION (ts = '98765', hour = '01')
+ |VALUES (12, 2, 12345)
+ """.stripMargin
+ )
+
+ checkAnswer(
+ sql(s"SELECT * FROM $db.t"),
+ Row("12", "2", 12345, "98765", "01"))
+
+ client.runSqlHive(s"ALTER TABLE $db.t ADD COLUMNS (dummy string)")
+
+ checkAnswer(
+ sql(s"SELECT click_id, search_id FROM $db.t"),
+ Row("12", "2"))
+
+ checkAnswer(
+ sql(s"SELECT search_id, click_id FROM $db.t"),
+ Row("2", "12"))
+
+ checkAnswer(
+ sql(s"SELECT search_id FROM $db.t"),
+ Row("2"))
+
+ checkAnswer(
+ sql(s"SELECT dummy, click_id FROM $db.t"),
+ Row(null, "12"))
+
+ checkAnswer(
+ sql(s"SELECT * FROM $db.t"),
+ Row("12", "2", 12345, null, "98765", "01"))
+ }
+ }
+ }
+ }
+
+ // This test case is added to prevent regression.
+ test("SPARK-22267 Spark SQL incorrectly reads ORC files when column order is different") {
--- End diff --
This is added to prevent regression according to your request, @gatorsmile ~
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19470: [SPARK-14387][SPARK-18355][SQL] Use Spark schema to read...
Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on the issue:
https://github.com/apache/spark/pull/19470
Hi, @gatorsmile and @cloud-fan .
Could you review this PR?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19470: [SPARK-14387][SPARK-18355][SQL] Use Spark schema ...
Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:
https://github.com/apache/spark/pull/19470#discussion_r144457235
--- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileFormat.scala ---
@@ -138,8 +138,7 @@ class OrcFileFormat extends FileFormat with DataSourceRegister with Serializable
if (maybePhysicalSchema.isEmpty) {
--- End diff --
nit
```
val isEmptyFile = OrcFileOperator.readSchema(Seq(file.filePath), Some(conf)).isEmpty
if (isEmptyFile) {
...
} else ...
```
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19470: [SPARK-14387][SPARK-18355][SQL] Use Spark schema to read...
Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on the issue:
https://github.com/apache/spark/pull/19470
Hi, @gatorsmile .
- **For mismatched column orders**: [SPARK-22267](https://issues.apache.org/jira/browse/SPARK-22267) shows the current status of ORC.
- **For mismatched column types**: Parquet also does not support.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19470: [SPARK-14387][SPARK-18355][SQL] Use Spark schema to read...
Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on the issue:
https://github.com/apache/spark/pull/19470
Ya, that was my question, too.
- What kind of difference does Spark support, especially in ORC? Apache Spark only supports HiveFileFormat so far, not old OrcFileFormat.
- In addition, there is no Schema Merging. Randomly (usually the bigging ORC file?), the first correct ORC file schema is used now. For old ORC cases, those are meaningless like `_colX`. For me, HiveMetastore schema is the only valid one in Apache Spark.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19470: [SPARK-14387][SPARK-16628][SPARK-18355][SQL] Use Spark s...
Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:
https://github.com/apache/spark/pull/19470
thanks, merging to master/2.2!
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19470: [SPARK-14387][SPARK-16628][SPARK-18355][SQL] Use ...
Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:
https://github.com/apache/spark/pull/19470
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19470: [SPARK-14387][SPARK-16628][SPARK-18355][SQL] Use Spark s...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19470
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19470: [SPARK-14387][SPARK-18355][SQL] Use Spark schema ...
Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:
https://github.com/apache/spark/pull/19470#discussion_r144457977
--- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala ---
@@ -2050,4 +2050,80 @@ class SQLQuerySuite extends QueryTest with SQLTestUtils with TestHiveSingleton {
}
}
}
+
+ test("SPARK-18355 Use Spark schema to read ORC table instead of ORC file schema") {
+ val client = spark.sharedState.externalCatalog.asInstanceOf[HiveExternalCatalog].client
+
+ Seq("true", "false").foreach { value =>
+ withSQLConf(HiveUtils.CONVERT_METASTORE_ORC.key -> value) {
+ withTempDatabase { db =>
+ client.runSqlHive(
+ s"""
+ |CREATE TABLE $db.t(
+ | click_id string,
+ | search_id string,
+ | uid bigint)
+ |PARTITIONED BY (
+ | ts string,
+ | hour string)
+ |STORED AS ORC
+ """.stripMargin)
+
+ client.runSqlHive(
+ s"""
+ |INSERT INTO TABLE $db.t
+ |PARTITION (ts = '98765', hour = '01')
+ |VALUES (12, 2, 12345)
+ """.stripMargin
+ )
+
+ checkAnswer(
+ sql(s"SELECT * FROM $db.t"),
+ Row("12", "2", 12345, "98765", "01"))
+
+ client.runSqlHive(s"ALTER TABLE $db.t ADD COLUMNS (dummy string)")
+
+ checkAnswer(
+ sql(s"SELECT click_id, search_id FROM $db.t"),
+ Row("12", "2"))
+
+ checkAnswer(
+ sql(s"SELECT search_id, click_id FROM $db.t"),
+ Row("2", "12"))
+
+ checkAnswer(
+ sql(s"SELECT search_id FROM $db.t"),
+ Row("2"))
+
+ checkAnswer(
+ sql(s"SELECT dummy, click_id FROM $db.t"),
+ Row(null, "12"))
+
+ checkAnswer(
+ sql(s"SELECT * FROM $db.t"),
--- End diff --
ditto
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19470: [SPARK-18355][SQL] Use Spark schema to read ORC table in...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19470
**[Test build #82620 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82620/testReport)** for PR 19470 at commit [`d11ce09`](https://github.com/apache/spark/commit/d11ce09d07f36f39c4a8a92c4f0ff0f19a35a724).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19470: [SPARK-14387][SPARK-16628][SPARK-18355][SQL] Use ...
Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:
https://github.com/apache/spark/pull/19470#discussion_r144469314
--- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala ---
@@ -2050,4 +2050,64 @@ class SQLQuerySuite extends QueryTest with SQLTestUtils with TestHiveSingleton {
}
}
}
+
+ Seq("orc", "parquet").foreach { format =>
+ test(s"SPARK-18355 Read data from a hive table with a new column - $format") {
+ val client = spark.sharedState.externalCatalog.asInstanceOf[HiveExternalCatalog].client
+
+ Seq("true", "false").foreach { value =>
+ withSQLConf(
+ HiveUtils.CONVERT_METASTORE_ORC.key -> value,
+ HiveUtils.CONVERT_METASTORE_PARQUET.key -> value) {
--- End diff --
As you separate orc and parquet to two test in fact, maybe you just need to test against one config at one time, i.e., orc -> HiveUtils.CONVERT_METASTORE_ORC, parquet -> HiveUtils.CONVERT_METASTORE_PARQUET.key.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19470: [SPARK-14387][SPARK-16628][SPARK-18355][SQL] Use Spark s...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19470
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82718/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19470: [SPARK-14387][SPARK-18355][SQL] Use Spark schema to read...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19470
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19470: [SPARK-14387][SPARK-18355][SQL] Use Spark schema ...
Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:
https://github.com/apache/spark/pull/19470#discussion_r144457932
--- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala ---
@@ -2050,4 +2050,80 @@ class SQLQuerySuite extends QueryTest with SQLTestUtils with TestHiveSingleton {
}
}
}
+
+ test("SPARK-18355 Use Spark schema to read ORC table instead of ORC file schema") {
+ val client = spark.sharedState.externalCatalog.asInstanceOf[HiveExternalCatalog].client
+
+ Seq("true", "false").foreach { value =>
+ withSQLConf(HiveUtils.CONVERT_METASTORE_ORC.key -> value) {
+ withTempDatabase { db =>
+ client.runSqlHive(
+ s"""
+ |CREATE TABLE $db.t(
+ | click_id string,
+ | search_id string,
+ | uid bigint)
+ |PARTITIONED BY (
+ | ts string,
+ | hour string)
+ |STORED AS ORC
+ """.stripMargin)
+
+ client.runSqlHive(
+ s"""
+ |INSERT INTO TABLE $db.t
+ |PARTITION (ts = '98765', hour = '01')
+ |VALUES (12, 2, 12345)
+ """.stripMargin
+ )
+
+ checkAnswer(
+ sql(s"SELECT * FROM $db.t"),
--- End diff --
please list all columns here instead of `*`, to make the test more clear
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19470: [SPARK-14387][SPARK-18355][SQL] Use Spark schema to read...
Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:
https://github.com/apache/spark/pull/19470
I remember we previously hit multiple issues due to the schema difference between the actual orc-file schema and the metastore schema. Just ensure it still exists.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19470: [SPARK-14387][SPARK-16628][SPARK-18355][SQL] Use Spark s...
Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on the issue:
https://github.com/apache/spark/pull/19470
Thank you so much, @cloud-fan , @gatorsmile , and @viirya !
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19470: [SPARK-14387][SPARK-16628][SPARK-18355][SQL] Use Spark s...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19470
**[Test build #82724 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82724/testReport)** for PR 19470 at commit [`8e7fe9b`](https://github.com/apache/spark/commit/8e7fe9b9e3fc6121686caad45dcfdb4ff08f0c4a).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19470: [SPARK-14387][SPARK-18355][SQL] Use Spark schema ...
Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:
https://github.com/apache/spark/pull/19470#discussion_r144457796
--- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileFormat.scala ---
@@ -272,25 +272,35 @@ private[orc] object OrcRelation extends HiveInspectors {
def unwrapOrcStructs(
conf: Configuration,
dataSchema: StructType,
+ requiredSchema: StructType,
maybeStructOI: Option[StructObjectInspector],
iterator: Iterator[Writable]): Iterator[InternalRow] = {
val deserializer = new OrcSerde
- val mutableRow = new SpecificInternalRow(dataSchema.map(_.dataType))
- val unsafeProjection = UnsafeProjection.create(dataSchema)
+ val mutableRow = new SpecificInternalRow(requiredSchema.map(_.dataType))
+ val unsafeProjection = UnsafeProjection.create(requiredSchema)
def unwrap(oi: StructObjectInspector): Iterator[InternalRow] = {
- val (fieldRefs, fieldOrdinals) = dataSchema.zipWithIndex.map {
- case (field, ordinal) => oi.getStructFieldRef(field.name) -> ordinal
+ val (fieldRefs, fieldOrdinals) = requiredSchema.zipWithIndex.map {
+ case (field, ordinal) =>
+ var ref = oi.getStructFieldRef(field.name)
+ if (ref == null) {
+ val maybeIndex = dataSchema.getFieldIndex(field.name)
+ if (maybeIndex.isDefined) {
+ ref = oi.getStructFieldRef("_col" + maybeIndex.get)
+ }
+ }
+ ref -> ordinal
}.unzip
- val unwrappers = fieldRefs.map(unwrapperFor)
+ val unwrappers = fieldRefs.map(r => if (r == null) null else unwrapperFor(r))
iterator.map { value =>
val raw = deserializer.deserialize(value)
var i = 0
val length = fieldRefs.length
while (i < length) {
- val fieldValue = oi.getStructFieldData(raw, fieldRefs(i))
+ val fieldRef = fieldRefs(i)
+ val fieldValue = if (fieldRef == null) null else oi.getStructFieldData(raw, fieldRefs(i))
if (fieldValue == null) {
--- End diff --
nit:
```
if (fieldRef == null) {
row.setNull...
} else {
val fieldValue = ...
...
}
```
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19470: [SPARK-14387][SPARK-16628][SPARK-18355][SQL] Use Spark s...
Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on the issue:
https://github.com/apache/spark/pull/19470
BTW, @cloud-fan . Could you review #18460 , too? I think we need your final approval. :)
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19470: [SPARK-14387][SPARK-16628][SPARK-18355][SQL] Use Spark s...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19470
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82724/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19470: [SPARK-14387][SPARK-18355][SQL] Use Spark schema to read...
Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:
https://github.com/apache/spark/pull/19470
LGTM except some minor comments
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19470: [SPARK-14387][SPARK-18355][SQL] Use Spark schema to read...
Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on the issue:
https://github.com/apache/spark/pull/19470
@cloud-fan . Thank you so much for review!
I updated the PR except one: If`fieldValue` is `null`, we also use `setNull` again in `else`. So, the current one is simpler.
```scala
if (fieldRef == null) {
row.setNull...
} else {
val fieldValue = ...
...
}
```
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19470: [SPARK-18355][SQL] Use Spark schema to read ORC table in...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19470
**[Test build #82620 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82620/testReport)** for PR 19470 at commit [`d11ce09`](https://github.com/apache/spark/commit/d11ce09d07f36f39c4a8a92c4f0ff0f19a35a724).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19470: [SPARK-14387][SPARK-16628][SPARK-18355][SQL] Use Spark s...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19470
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82720/
Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19470: [SPARK-14387][SPARK-18355][SQL] Use Spark schema ...
Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:
https://github.com/apache/spark/pull/19470#discussion_r144457357
--- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileFormat.scala ---
@@ -138,8 +138,7 @@ class OrcFileFormat extends FileFormat with DataSourceRegister with Serializable
if (maybePhysicalSchema.isEmpty) {
Iterator.empty
} else {
- val physicalSchema = maybePhysicalSchema.get
- OrcRelation.setRequiredColumns(conf, physicalSchema, requiredSchema)
+ OrcRelation.setRequiredColumns(conf, dataSchema, requiredSchema)
--- End diff --
does it work? seems here we lie to the orc reader about the physical schema.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19470: [SPARK-14387][SPARK-18355][SQL] Use Spark schema to read...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19470
**[Test build #82701 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82701/testReport)** for PR 19470 at commit [`8ac1acf`](https://github.com/apache/spark/commit/8ac1acfad74eba27a7123e04ca14682bff59a20b).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19470: [SPARK-18355][SQL] Use Spark schema to read ORC table in...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19470
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19470: [SPARK-14387][SPARK-18355][SQL] Use Spark schema ...
Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:
https://github.com/apache/spark/pull/19470#discussion_r144458108
--- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala ---
@@ -2050,4 +2050,80 @@ class SQLQuerySuite extends QueryTest with SQLTestUtils with TestHiveSingleton {
}
}
}
+
+ test("SPARK-18355 Use Spark schema to read ORC table instead of ORC file schema") {
+ val client = spark.sharedState.externalCatalog.asInstanceOf[HiveExternalCatalog].client
+
+ Seq("true", "false").foreach { value =>
+ withSQLConf(HiveUtils.CONVERT_METASTORE_ORC.key -> value) {
+ withTempDatabase { db =>
+ client.runSqlHive(
+ s"""
+ |CREATE TABLE $db.t(
+ | click_id string,
+ | search_id string,
+ | uid bigint)
+ |PARTITIONED BY (
+ | ts string,
+ | hour string)
+ |STORED AS ORC
+ """.stripMargin)
+
+ client.runSqlHive(
+ s"""
+ |INSERT INTO TABLE $db.t
+ |PARTITION (ts = '98765', hour = '01')
+ |VALUES (12, 2, 12345)
+ """.stripMargin
+ )
+
+ checkAnswer(
+ sql(s"SELECT * FROM $db.t"),
+ Row("12", "2", 12345, "98765", "01"))
+
+ client.runSqlHive(s"ALTER TABLE $db.t ADD COLUMNS (dummy string)")
+
+ checkAnswer(
+ sql(s"SELECT click_id, search_id FROM $db.t"),
+ Row("12", "2"))
+
+ checkAnswer(
+ sql(s"SELECT search_id, click_id FROM $db.t"),
+ Row("2", "12"))
+
+ checkAnswer(
+ sql(s"SELECT search_id FROM $db.t"),
+ Row("2"))
+
+ checkAnswer(
+ sql(s"SELECT dummy, click_id FROM $db.t"),
+ Row(null, "12"))
+
+ checkAnswer(
+ sql(s"SELECT * FROM $db.t"),
+ Row("12", "2", 12345, null, "98765", "01"))
+ }
+ }
+ }
+ }
+
+ // This test case is added to prevent regression.
+ test("SPARK-22267 Spark SQL incorrectly reads ORC files when column order is different") {
--- End diff --
it's weird to have a test verifying a bug, I think it's good enough to have a JIRA tracking this bug.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19470: [SPARK-14387][SPARK-16628][SPARK-18355][SQL] Use Spark s...
Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:
https://github.com/apache/spark/pull/19470
LGTM, pending jenkins
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19470: [SPARK-14387][SPARK-18355][SQL] Use Spark schema to read...
Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on the issue:
https://github.com/apache/spark/pull/19470
To be clear, I'll file another JIRA about ORC status on mismatched column orders.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19470: [SPARK-14387][SPARK-18355][SQL] Use Spark schema to read...
Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:
https://github.com/apache/spark/pull/19470
Could you create test cases with the different schemas between files and hive metastore.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19470: [SPARK-14387][SPARK-16628][SPARK-18355][SQL] Use Spark s...
Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/19470
One minor comment doesn't affect this. LGTM.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19470: [SPARK-14387][SPARK-16628][SPARK-18355][SQL] Use Spark s...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19470
**[Test build #82724 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82724/testReport)** for PR 19470 at commit [`8e7fe9b`](https://github.com/apache/spark/commit/8e7fe9b9e3fc6121686caad45dcfdb4ff08f0c4a).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19470: [SPARK-14387][SPARK-16628][SPARK-18355][SQL] Use Spark s...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19470
**[Test build #82720 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82720/testReport)** for PR 19470 at commit [`8e7fe9b`](https://github.com/apache/spark/commit/8e7fe9b9e3fc6121686caad45dcfdb4ff08f0c4a).
* This patch **fails due to an unknown error code, -9**.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19470: [SPARK-14387][SPARK-16628][SPARK-18355][SQL] Use Spark s...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19470
**[Test build #82718 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82718/testReport)** for PR 19470 at commit [`ef2123e`](https://github.com/apache/spark/commit/ef2123ecc516fce6feb2a4abe051b4ef862c51a0).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19470: [SPARK-14387][SPARK-18355][SQL] Use Spark schema to read...
Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on the issue:
https://github.com/apache/spark/pull/19470
Thank you for review, @gatorsmile .
Sure. I assume that you want to check the regression here. Could you tell me the degree of difference?
Here, this PR is focusing on missing-columns scenario after `ADD COLUMNS`. This is a typical use cases by customers.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19470: [SPARK-14387][SPARK-16628][SPARK-18355][SQL] Use Spark s...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19470
**[Test build #82720 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82720/testReport)** for PR 19470 at commit [`8e7fe9b`](https://github.com/apache/spark/commit/8e7fe9b9e3fc6121686caad45dcfdb4ff08f0c4a).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19470: [SPARK-14387][SPARK-16628][SPARK-18355][SQL] Use Spark s...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19470
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19470: [SPARK-14387][SPARK-16628][SPARK-18355][SQL] Use ...
Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:
https://github.com/apache/spark/pull/19470#discussion_r144465324
--- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala ---
@@ -2050,4 +2050,60 @@ class SQLQuerySuite extends QueryTest with SQLTestUtils with TestHiveSingleton {
}
}
}
+
+ test("SPARK-18355 Use Spark schema to read ORC table instead of ORC file schema") {
--- End diff --
Improve the test case for checking the other formats?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19470: [SPARK-14387][SPARK-18355][SQL] Use Spark schema to read...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19470
**[Test build #82718 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82718/testReport)** for PR 19470 at commit [`ef2123e`](https://github.com/apache/spark/commit/ef2123ecc516fce6feb2a4abe051b4ef862c51a0).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19470: [SPARK-14387][SPARK-18355][SQL] Use Spark schema ...
Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:
https://github.com/apache/spark/pull/19470#discussion_r144457479
--- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileFormat.scala ---
@@ -272,25 +272,35 @@ private[orc] object OrcRelation extends HiveInspectors {
def unwrapOrcStructs(
conf: Configuration,
dataSchema: StructType,
+ requiredSchema: StructType,
maybeStructOI: Option[StructObjectInspector],
iterator: Iterator[Writable]): Iterator[InternalRow] = {
val deserializer = new OrcSerde
- val mutableRow = new SpecificInternalRow(dataSchema.map(_.dataType))
- val unsafeProjection = UnsafeProjection.create(dataSchema)
+ val mutableRow = new SpecificInternalRow(requiredSchema.map(_.dataType))
+ val unsafeProjection = UnsafeProjection.create(requiredSchema)
def unwrap(oi: StructObjectInspector): Iterator[InternalRow] = {
- val (fieldRefs, fieldOrdinals) = dataSchema.zipWithIndex.map {
- case (field, ordinal) => oi.getStructFieldRef(field.name) -> ordinal
+ val (fieldRefs, fieldOrdinals) = requiredSchema.zipWithIndex.map {
+ case (field, ordinal) =>
+ var ref = oi.getStructFieldRef(field.name)
+ if (ref == null) {
+ val maybeIndex = dataSchema.getFieldIndex(field.name)
--- End diff --
the `requiredSchema` is guaranteed to be contained in the `dataSchema`.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19470: [SPARK-14387][SPARK-18355][SQL] Use Spark schema ...
Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:
https://github.com/apache/spark/pull/19470#discussion_r144457643
--- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileFormat.scala ---
@@ -138,8 +138,7 @@ class OrcFileFormat extends FileFormat with DataSourceRegister with Serializable
if (maybePhysicalSchema.isEmpty) {
Iterator.empty
} else {
- val physicalSchema = maybePhysicalSchema.get
- OrcRelation.setRequiredColumns(conf, physicalSchema, requiredSchema)
+ OrcRelation.setRequiredColumns(conf, dataSchema, requiredSchema)
--- End diff --
oh i see, we only need to pass the required column indices to orc reader.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19470: [SPARK-14387][SPARK-16628][SPARK-18355][SQL] Use ...
Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on a diff in the pull request:
https://github.com/apache/spark/pull/19470#discussion_r144472506
--- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala ---
@@ -2050,4 +2050,64 @@ class SQLQuerySuite extends QueryTest with SQLTestUtils with TestHiveSingleton {
}
}
}
+
+ Seq("orc", "parquet").foreach { format =>
+ test(s"SPARK-18355 Read data from a hive table with a new column - $format") {
+ val client = spark.sharedState.externalCatalog.asInstanceOf[HiveExternalCatalog].client
+
+ Seq("true", "false").foreach { value =>
+ withSQLConf(
+ HiveUtils.CONVERT_METASTORE_ORC.key -> value,
+ HiveUtils.CONVERT_METASTORE_PARQUET.key -> value) {
--- End diff --
Thank you for review, @viirya . For that, yes, we can, but that will be a little-bit overkill.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19470: [SPARK-14387][SPARK-16628][SPARK-18355][SQL] Use Spark s...
Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on the issue:
https://github.com/apache/spark/pull/19470
Now, it's passed again. :)
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19470: [SPARK-14387][SPARK-16628][SPARK-18355][SQL] Use Spark s...
Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:
https://github.com/apache/spark/pull/19470
LGTM too.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19470: [SPARK-14387][SPARK-16628][SPARK-18355][SQL] Use Spark s...
Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on the issue:
https://github.com/apache/spark/pull/19470
R failure seems to be irrelevant.
```
[error] running /home/jenkins/workspace/SparkPullRequestBuilder/R/run-tests.sh ; process was terminated by signal 9
Attempting to post to Github...
```
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org