You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2021/01/07 01:55:40 UTC

[GitHub] [spark] dongjoon-hyun commented on a change in pull request #31075: [SPARK-34036][DOCS] Update ORC data source documentation

dongjoon-hyun commented on a change in pull request #31075:
URL: https://github.com/apache/spark/pull/31075#discussion_r553065320



##########
File path: docs/sql-data-sources-orc.md
##########
@@ -19,12 +19,93 @@ license: |
   limitations under the License.
 ---
 
-Since Spark 2.3, Spark supports a vectorized ORC reader with a new ORC file format for ORC files.
-To do that, the following configurations are newly added. The vectorized reader is used for the
-native ORC tables (e.g., the ones created using the clause `USING ORC`) when `spark.sql.orc.impl`
-is set to `native` and `spark.sql.orc.enableVectorizedReader` is set to `true`. For the Hive ORC
-serde tables (e.g., the ones created using the clause `USING HIVE OPTIONS (fileFormat 'ORC')`),
-the vectorized reader is used when `spark.sql.hive.convertMetastoreOrc` is also set to `true`.
+* Table of contents
+{:toc}
+
+[Apache ORC](https://orc.apache.org) is a columnar format which has more advanced features like bloom filter and columnar cncryption.
+
+### ORC Implementation
+
+Spark supports two ORC implementations (`native` and `hive`) which is controlled by `spark.sql.orc.impl`.
+Two implementations share most functionalities with different design goals.
+- `native` implementation is designed to follow Spark's data source behavior like `Parquet`.
+- `hive` implementation is designed to follow Hive's behavior and uses Hive SerDe.
+
+For example, historically, `native` implementation handles `CHAR/VARCHAR` with Spark's native `String` while `hive` implementation handles it via Hive `CHAR/VARCHAR`. The query results are different. Since Spark 3.1.0, [SPARK-33480](https://issues.apache.org/jira/browse/SPARK-33480) removes this difference by supporting `CHAR/VARCHAR` from Spark-side.
+
+### Vectorized Reader
+
+`native` implementation supports a vectorized ORC reader and has been the default ORC implementaion since Spark 2.3.
+The vectorized reader is used for the native ORC tables (e.g., the ones created using the clause `USING ORC`) when `spark.sql.orc.impl` is set to `native` and `spark.sql.orc.enableVectorizedReader` is set to `true`.
+For the Hive ORC serde tables (e.g., the ones created using the clause `USING HIVE OPTIONS (fileFormat 'ORC')`),
+the vectorized reader is used when `spark.sql.hive.convertMetastoreOrc` is also set to `true`, and is turned on by default.
+
+### Schema Merging
+
+Like Protocol Buffer, Avro, and Thrift, ORC also supports schema evolution. Users can start with
+a simple schema, and gradually add more columns to the schema as needed. In this way, users may end
+up with multiple Parquet files with different but mutually compatible schemas. The ORC data

Review comment:
       Thanks!




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org