You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Zhan Zhang (JIRA)" <ji...@apache.org> on 2014/10/16 04:47:33 UTC

[jira] [Updated] (SPARK-3720) support ORC in spark sql

     [ https://issues.apache.org/jira/browse/SPARK-3720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Zhan Zhang updated SPARK-3720:
------------------------------
    Attachment: orc.diff

This is the diff for orc file support. Because I also work on this item in parallel for spark-2883. Since there is PR already opened for this jira, I attach my diff here, and will work with wangfei to consolidate our work and work together to this feature. Following is the detail description.

1. Basic Operator: saveAsOrcFile and OrcFile. The former is used to save the table into orc format file, and the latter is used to import orc format file into spark sql table.
2. Column pruning
3. Self-contained schema support: The orc support is fully functional independent of hive metastore. The table schema is maintained by the orc file itself.
4. To support the orc file, user need to: import import org.apache.spark.sql.hive.orc._ to bring in the orc support into context
5. The orc file is operated in HiveContext, the only reason is due to package issue, and we don’t want to bring in hive dependency into spark sql. Note that orc operations does not relies on Hive metastore.
6. It support full complicated dataType in Spark Sql, for example, list, seq, and nested datatype.
7. Because the feature is supported in HiveContext, so the sql parser is actually using hive parser.

Hive 0.13.1 support.
With minor change, after spark hive upgraded to 0.13.1
1. the orc can support different compression method, e.g., SNAPPY, LZO, ZLIB, and NONE
2. prediction pushdown
Following is the example to use orc file, which is almost identical to the parquet format support from user perspective.
import org.apache.spark.sql.hive.orc._
val ctx = new org.apache.spark.sql.hive.HiveContext(sc)
val people = sc.textFile("examples/src/main/resources/people.txt")
val schemaString = "name age"
import org.apache.spark.sql._
val schema = StructType(schemaString.split(" ").map(fieldName => StructField(fieldName, StringType, true)))
val rowRDD = people.map(_.split(",")).map(p => Row(p(0), p(1).trim))
val peopleSchemaRDD = ctx.applySchema(rowRDD, schema)
peopleSchemaRDD.registerTempTable("people")
val results = ctx.sql("SELECT name FROM people")
results.map(t => "Name: " + t(0)).collect().foreach(println)
peopleSchemaRDD.saveAsOrcFile("people.orc")
val orcFile = ctx.orcFile("people.orc")
orcFile.registerTempTable("orcFile")
val teenagers = ctx.sql("SELECT name FROM orcFile WHERE age >= 13 AND age <= 19")
teenagers.map(t => "Name: " + t(0)).collect().foreach(println)

> support ORC in spark sql
> ------------------------
>
>                 Key: SPARK-3720
>                 URL: https://issues.apache.org/jira/browse/SPARK-3720
>             Project: Spark
>          Issue Type: New Feature
>          Components: SQL
>    Affects Versions: 1.1.0
>            Reporter: wangfei
>         Attachments: orc.diff
>
>
> The Optimized Row Columnar (ORC) file format provides a highly efficient way to store data on hdfs.ORC file format has many advantages such as:
> 1 a single file as the output of each task, which reduces the NameNode's load
> 2 Hive type support including datetime, decimal, and the complex types (struct, list, map, and union)
> 3 light-weight indexes stored within the file
> skip row groups that don't pass predicate filtering
> seek to a given row
> 4 block-mode compression based on data type
> run-length encoding for integer columns
> dictionary encoding for string columns
> 5 concurrent reads of the same file using separate RecordReaders
> 6 ability to split files without scanning for markers
> 7 bound the amount of memory needed for reading or writing
> 8 metadata stored using Protocol Buffers, which allows addition and removal of fields
> Now spark sql support Parquet, support ORC provide people more opts.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org