You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by gengliangwang <gi...@git.apache.org> on 2018/08/16 14:41:11 UTC

[GitHub] spark pull request #22121: [SPARK-25133][SQL][Doc]AVRO data source guide

GitHub user gengliangwang opened a pull request:

    https://github.com/apache/spark/pull/22121

    [SPARK-25133][SQL][Doc]AVRO data source guide

    ## What changes were proposed in this pull request?
    
    Create documentation for AVRO data source.
    The new page will be linked in https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html .
    
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/gengliangwang/spark avroDoc

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/22121.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #22121
    
----
commit 3d8220f1d9145fb6606bc16bf62cc92c2aaddb35
Author: Gengliang Wang <ge...@...>
Date:   2018-08-16T14:18:22Z

    add avro-data-source-guide.md

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22121: [SPARK-25133][SQL][Doc]Avro data source guide

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22121
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2439/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22121: [SPARK-25133][SQL][Doc]Avro data source guide

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22121
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22121: [SPARK-25133][SQL][Doc]Avro data source guide

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22121
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95105/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22121: [SPARK-25133][SQL][Doc]Avro data source guide

Posted by dhruve <gi...@git.apache.org>.

Github user dhruve commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22121#discussion_r212032223
  
    --- Diff: docs/avro-data-source-guide.md ---
    @@ -0,0 +1,377 @@
    +---
    +layout: global
    +title: Apache Avro Data Source Guide
    +---
    +
    +* This will become a table of contents (this text will be scraped).
    +{:toc}
    +
    +Since Spark 2.4 release, [Spark SQL](https://spark.apache.org/docs/latest/sql-programming-guide.html) provides built-in support for reading and writing Apache Avro data.
    +
    +## Deploying
    +The `spark-avro` module is external and not included in `spark-submit` or `spark-shell` by default.
    +
    +As with any Spark applications, `spark-submit` is used to launch your application. `spark-avro_{{site.SCALA_BINARY_VERSION}}`
    +and its dependencies can be directly added to `spark-submit` using `--packages`, such as,
    +
    +    ./bin/spark-submit --packages org.apache.spark:spark-avro_{{site.SCALA_BINARY_VERSION}}:{{site.SPARK_VERSION_SHORT}} ...
    +
    +For experimenting on `spark-shell`, you can also use `--packages` to add `org.apache.spark:spark-avro_{{site.SCALA_BINARY_VERSION}}` and its dependencies directly,
    +
    +    ./bin/spark-shell --packages org.apache.spark:spark-avro_{{site.SCALA_BINARY_VERSION}}:{{site.SPARK_VERSION_SHORT}} ...
    +
    +See [Application Submission Guide](submitting-applications.html) for more details about submitting applications with external dependencies.
    +
    +## Load and Save Functions
    +
    +Since `spark-avro` module is external, there is no `.avro` API in 
    +`DataFrameReader` or `DataFrameWriter`.
    +
    +To load/save data in Avro format, you need to specify the data source option `format` as `avro`(or `org.apache.spark.sql.avro`).
    +<div class="codetabs">
    +<div data-lang="scala" markdown="1">
    +{% highlight scala %}
    +
    +val usersDF = spark.read.format("avro").load("examples/src/main/resources/users.avro")
    +usersDF.select("name", "favorite_color").write.format("avro").save("namesAndFavColors.avro")
    +
    +{% endhighlight %}
    +</div>
    +<div data-lang="java" markdown="1">
    +{% highlight java %}
    +
    +Dataset<Row> usersDF = spark.read().format("avro").load("examples/src/main/resources/users.avro");
    +usersDF.select("name", "favorite_color").write().format("avro").save("namesAndFavColors.avro");
    +
    +{% endhighlight %}
    +</div>
    +<div data-lang="python" markdown="1">
    +{% highlight python %}
    +
    +df = spark.read.format("avro").load("examples/src/main/resources/users.avro")
    +df.select("name", "favorite_color").write.format("avro").save("namesAndFavColors.avro")
    +
    +{% endhighlight %}
    +</div>
    +<div data-lang="r" markdown="1">
    +{% highlight r %}
    +
    +df <- read.df("examples/src/main/resources/users.avro", "avro")
    +write.df(select(df, "name", "favorite_color"), "namesAndFavColors.avro", "avro")
    +
    +{% endhighlight %}
    +</div>
    +</div>
    +
    +## to_avro() and from_avro()
    +Spark SQL provides function `to_avro` to encode a struct as a string and `from_avro()` to retrieve the struct as a complex type.
    +
    +Using Avro record as columns are useful when reading from or writing to a streaming source like Kafka. Each 
    +Kafka key-value record will be augmented with some metadata, such as the ingestion timestamp into Kafka, the offset in Kafka, etc.
    +* If the "value" field that contains your data is in Avro, you could use `from_avro()` to extract your data, enrich it, clean it, and then push it downstream to Kafka again or write it out to a file.
    +* `to_avro()` can be used to turn structs into Avro records. This method is particularly useful when you would like to re-encode multiple columns into a single one when writing data out to Kafka.
    +
    +Both methods are presently only available in Scala and Java.
    +
    +<div class="codetabs">
    +<div data-lang="scala" markdown="1">
    +{% highlight scala %}
    +import org.apache.spark.sql.avro._
    +
    +// `from_avro` requires Avro schema in JSON string format.
    +val jsonFormatSchema = new String(Files.readAllBytes(Paths.get("./examples/src/main/resources/user.avsc")))
    +
    +val df = spark
    +  .readStream
    +  .format("kafka")
    +  .option("kafka.bootstrap.servers", "host1:port1,host2:port2")
    +  .option("subscribe", "topic1")
    +  .load()
    +
    +// 1. Decode the Avro data into a struct;
    +// 2. Filter by column `favorite_color`;
    +// 3. Encode the column `name` in Avro format.
    +val output = df
    +  .select(from_avro('value, jsonFormatSchema) as 'user)
    +  .where("user.favorite_color == \"red\"")
    +  .select(to_avro($"user.name") as 'value)
    +
    +val ds = output
    +  .writeStream
    +  .format("kafka")
    +  .option("kafka.bootstrap.servers", "host1:port1,host2:port2")
    +  .option("topic", "topic2")
    +  .start()
    +
    +{% endhighlight %}
    +</div>
    +<div data-lang="java" markdown="1">
    +{% highlight java %}
    +import org.apache.spark.sql.avro.*
    +
    +// `from_avro` requires Avro schema in JSON string format.
    +String jsonFormatSchema = new String(Files.readAllBytes(Paths.get("./examples/src/main/resources/user.avsc")))
    +
    +Dataset<Row> df = spark
    +  .readStream()
    +  .format("kafka")
    +  .option("kafka.bootstrap.servers", "host1:port1,host2:port2")
    +  .option("subscribe", "topic1")
    +  .load()
    +
    +// 1. Decode the Avro data into a struct;
    +// 2. Filter by column `favorite_color`;
    +// 3. Encode the column `name` in Avro format.
    +DataFrame output = df
    +  .select(from_avro(col("value"), jsonFormatSchema).as("user"))
    +  .where("user.favorite_color == \"red\"")
    +  .select(to_avro(col("user.name")).as("value"))
    +
    +StreamingQuery ds = output
    +  .writeStream()
    +  .format("kafka")
    +  .option("kafka.bootstrap.servers", "host1:port1,host2:port2")
    +  .option("topic", "topic2")
    +  .start()
    +
    +{% endhighlight %}
    +</div>
    +</div>
    +
    +## Data Source Option
    +
    +Data source options of Avro can be set using the `.option` method on `DataFrameReader` or `DataFrameWriter`.
    +<table class="table">
    +  <tr><th><b>Property Name</b></th><th><b>Default</b></th><th><b>Meaning</b></th><th><b>Scope</b></th></tr>
    +  <tr>
    +    <td><code>avroSchema</code></td>
    +    <td>None</td>
    +    <td>Optional Avro schema provided by an user in JSON format.</td>
    +    <td>read and write</td>
    +  </tr>
    +  <tr>
    +    <td><code>recordName</code></td>
    +    <td>topLevelRecord</td>
    +    <td>Top level record name in write result, which is required in Avro spec.</td>
    +    <td>write</td>
    +  </tr>
    +  <tr>
    +    <td><code>recordNamespace</code></td>
    +    <td>""</td>
    +    <td>Record namespace in write result.</td>
    +    <td>write</td>
    +  </tr>
    +  <tr>
    +    <td><code>ignoreExtension</code></td>
    +    <td>true</td>
    +    <td>The option controls ignoring of files without <code>.avro</code> extensions in read.<br> If the option is enabled, all files (with and without <code>.avro</code> extension) are loaded.</td>
    +    <td>read</td>
    +  </tr>
    +  <tr>
    +    <td><code>compression</code></td>
    +    <td>snappy</td>
    +    <td>The <code>compression</code> option allows to specify a compression codec used in write.<br>
    +  Currently supported codecs are <code>uncompressed</code>, <code>snappy</code>, <code>deflate</code>, <code>bzip2</code> and <code>xz</code>.<br> If the option is not set, the configuration <code>spark.sql.avro.compression.codec</code> config is taken into account.</td>
    +    <td>write</td>
    +  </tr>
    +</table>
    +
    +## Configuration
    +Configuration of Avro can be done using the `setConf` method on SparkSession or by running `SET key=value` commands using SQL.
    +<table class="table">
    +  <tr><th><b>Property Name</b></th><th><b>Default</b></th><th><b>Meaning</b></th></tr>
    +  <tr>
    +    <td>spark.sql.legacy.replaceDatabricksSparkAvro.enabled</td>
    +    <td>true</td>
    +    <td>If it is set to true, the data source provider <code>com.databricks.spark.avro</code> is mapped to the built-in but external Avro data source module for backward compatibility.</td>
    +  </tr>
    +  <tr>
    +    <td>spark.sql.avro.compression.codec</td>
    +    <td>snappy</td>
    +    <td>Compression codec used in writing of AVRO files. Supported codecs: uncompressed, deflate, snappy, bzip2 and xz. Default codec is snappy.</td>
    +  </tr>
    +  <tr>
    +    <td>spark.sql.avro.deflate.level</td>
    +    <td>-1</td>
    +    <td>Compression level for the deflate codec used in writing of AVRO files. Valid value must be in the range of from 1 to 9 inclusive or -1. The default value is -1 which corresponds to 6 level in the current implementation.</td>
    +  </tr>
    +</table>
    +
    +## Compatibility with Databricks spark-avro
    +This Avro data source module is originally from and compatible with Databricks's open source repository 
    +[spark-avro](https://github.com/databricks/spark-avro).
    +
    +By default with the SQL configuration `spark.sql.legacy.replaceDatabricksSparkAvro.enabled` enabled, the data source provider `com.databricks.spark.avro` is 
    +mapped to this built-in Avro module. For the Spark tables created with `Provider` property as `com.databricks.spark.avro` in 
    +catalog meta store, the mapping is essential to load these tables if you are using this built-in Avro module. 
    +
    +Note in Databricks's [spark-avro](https://github.com/databricks/spark-avro), implicit classes 
    +`AvroDataFrameWriter` and `AvroDataFrameReader` were created for shortcut function `.avro()`. In this 
    +built-in but external module, both implicit classes are removed. Please use `.format("avro")` in 
    +`DataFrameWriter` or `DataFrameReader` instead, which should be clean and good enough.
    +
    +If you prefer using your own build of `spark-avro` jar file, you can simply disable the configuration 
    +`spark.sql.legacy.replaceDatabricksSparkAvro.enabled`, and use the option `--jars` on deploying your 
    +applications. Read the [Advanced Dependency Management](https://spark.apache
    +.org/docs/latest/submitting-applications.html#advanced-dependency-management) section in Application 
    +Submission Guide for more details. 
    +
    +## Supported types for Avro -> Spark SQL conversion
    +Currently Spark supports reading all [primitive types](https://avro.apache.org/docs/1.8.2/spec.html#schema_primitive) and [complex types](https://avro.apache.org/docs/1.8.2/spec.html#schema_complex) of Avro.
    --- End diff --
    
    Hey. I know that we didn't support reading primitive types in the databricks-avro package, so I just tried to read a primitive avro file and I wasn't able to do so using the current master. 
    
    How I tried reading it => `spark.read.format("avro").load("avroPrimitiveTypes/randomBoolean.avro")`
    
    I think we could reword and be explicit that we support reading primitive types under records unless I am missing something here.
    



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22121: [SPARK-25133][SQL][Doc]Avro data source guide

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22121#discussion_r211988526
  
    --- Diff: docs/avro-data-source-guide.md ---
    @@ -0,0 +1,377 @@
    +---
    +layout: global
    +title: Apache Avro Data Source Guide
    +---
    +
    +* This will become a table of contents (this text will be scraped).
    +{:toc}
    +
    +Since Spark 2.4 release, [Spark SQL](https://spark.apache.org/docs/latest/sql-programming-guide.html) provides built-in support for reading and writing Apache Avro data.
    +
    +## Deploying
    +The `spark-avro` module is external and not included in `spark-submit` or `spark-shell` by default.
    +
    +As with any Spark applications, `spark-submit` is used to launch your application. `spark-avro_{{site.SCALA_BINARY_VERSION}}`
    +and its dependencies can be directly added to `spark-submit` using `--packages`, such as,
    +
    +    ./bin/spark-submit --packages org.apache.spark:spark-avro_{{site.SCALA_BINARY_VERSION}}:{{site.SPARK_VERSION_SHORT}} ...
    +
    +For experimenting on `spark-shell`, you can also use `--packages` to add `org.apache.spark:spark-avro_{{site.SCALA_BINARY_VERSION}}` and its dependencies directly,
    +
    +    ./bin/spark-shell --packages org.apache.spark:spark-avro_{{site.SCALA_BINARY_VERSION}}:{{site.SPARK_VERSION_SHORT}} ...
    +
    +See [Application Submission Guide](submitting-applications.html) for more details about submitting applications with external dependencies.
    +
    +## Load and Save Functions
    +
    +Since `spark-avro` module is external, there is no `.avro` API in 
    +`DataFrameReader` or `DataFrameWriter`.
    +
    +To load/save data in Avro format, you need to specify the data source option `format` as `avro`(or `org.apache.spark.sql.avro`).
    +<div class="codetabs">
    +<div data-lang="scala" markdown="1">
    +{% highlight scala %}
    +
    +val usersDF = spark.read.format("avro").load("examples/src/main/resources/users.avro")
    +usersDF.select("name", "favorite_color").write.format("avro").save("namesAndFavColors.avro")
    +
    +{% endhighlight %}
    +</div>
    +<div data-lang="java" markdown="1">
    +{% highlight java %}
    +
    +Dataset<Row> usersDF = spark.read().format("avro").load("examples/src/main/resources/users.avro");
    +usersDF.select("name", "favorite_color").write().format("avro").save("namesAndFavColors.avro");
    +
    +{% endhighlight %}
    +</div>
    +<div data-lang="python" markdown="1">
    +{% highlight python %}
    +
    +df = spark.read.format("avro").load("examples/src/main/resources/users.avro")
    +df.select("name", "favorite_color").write.format("avro").save("namesAndFavColors.avro")
    +
    +{% endhighlight %}
    +</div>
    +<div data-lang="r" markdown="1">
    +{% highlight r %}
    +
    +df <- read.df("examples/src/main/resources/users.avro", "avro")
    +write.df(select(df, "name", "favorite_color"), "namesAndFavColors.avro", "avro")
    +
    +{% endhighlight %}
    +</div>
    +</div>
    +
    +## to_avro() and from_avro()
    +Spark SQL provides function `to_avro` to encode a struct as a string and `from_avro()` to retrieve the struct as a complex type.
    +
    +Using Avro record as columns are useful when reading from or writing to a streaming source like Kafka. Each 
    +Kafka key-value record will be augmented with some metadata, such as the ingestion timestamp into Kafka, the offset in Kafka, etc.
    +* If the "value" field that contains your data is in Avro, you could use `from_avro()` to extract your data, enrich it, clean it, and then push it downstream to Kafka again or write it out to a file.
    +* `to_avro()` can be used to turn structs into Avro records. This method is particularly useful when you would like to re-encode multiple columns into a single one when writing data out to Kafka.
    +
    +Both methods are presently only available in Scala and Java.
    +
    +<div class="codetabs">
    +<div data-lang="scala" markdown="1">
    +{% highlight scala %}
    +import org.apache.spark.sql.avro._
    +
    +// `from_avro` requires Avro schema in JSON string format.
    +val jsonFormatSchema = new String(Files.readAllBytes(Paths.get("./examples/src/main/resources/user.avsc")))
    +
    +val df = spark
    +  .readStream
    +  .format("kafka")
    +  .option("kafka.bootstrap.servers", "host1:port1,host2:port2")
    +  .option("subscribe", "topic1")
    +  .load()
    +
    +// 1. Decode the Avro data into a struct;
    +// 2. Filter by column `favorite_color`;
    +// 3. Encode the column `name` in Avro format.
    +val output = df
    +  .select(from_avro('value, jsonFormatSchema) as 'user)
    +  .where("user.favorite_color == \"red\"")
    +  .select(to_avro($"user.name") as 'value)
    +
    +val ds = output
    +  .writeStream
    +  .format("kafka")
    +  .option("kafka.bootstrap.servers", "host1:port1,host2:port2")
    +  .option("topic", "topic2")
    +  .start()
    +
    +{% endhighlight %}
    +</div>
    +<div data-lang="java" markdown="1">
    +{% highlight java %}
    +import org.apache.spark.sql.avro.*
    +
    +// `from_avro` requires Avro schema in JSON string format.
    +String jsonFormatSchema = new String(Files.readAllBytes(Paths.get("./examples/src/main/resources/user.avsc")))
    +
    +Dataset<Row> df = spark
    +  .readStream()
    +  .format("kafka")
    +  .option("kafka.bootstrap.servers", "host1:port1,host2:port2")
    +  .option("subscribe", "topic1")
    +  .load()
    +
    +// 1. Decode the Avro data into a struct;
    +// 2. Filter by column `favorite_color`;
    +// 3. Encode the column `name` in Avro format.
    +DataFrame output = df
    +  .select(from_avro(col("value"), jsonFormatSchema).as("user"))
    +  .where("user.favorite_color == \"red\"")
    +  .select(to_avro(col("user.name")).as("value"))
    +
    +StreamingQuery ds = output
    +  .writeStream()
    +  .format("kafka")
    +  .option("kafka.bootstrap.servers", "host1:port1,host2:port2")
    +  .option("topic", "topic2")
    +  .start()
    +
    +{% endhighlight %}
    +</div>
    +</div>
    +
    +## Data Source Option
    +
    +Data source options of Avro can be set using the `.option` method on `DataFrameReader` or `DataFrameWriter`.
    +<table class="table">
    +  <tr><th><b>Property Name</b></th><th><b>Default</b></th><th><b>Meaning</b></th><th><b>Scope</b></th></tr>
    +  <tr>
    +    <td><code>avroSchema</code></td>
    +    <td>None</td>
    +    <td>Optional Avro schema provided by an user in JSON format.</td>
    +    <td>read and write</td>
    +  </tr>
    +  <tr>
    +    <td><code>recordName</code></td>
    +    <td>topLevelRecord</td>
    +    <td>Top level record name in write result, which is required in Avro spec.</td>
    +    <td>write</td>
    +  </tr>
    +  <tr>
    +    <td><code>recordNamespace</code></td>
    +    <td>""</td>
    +    <td>Record namespace in write result.</td>
    +    <td>write</td>
    +  </tr>
    +  <tr>
    +    <td><code>ignoreExtension</code></td>
    +    <td>true</td>
    +    <td>The option controls ignoring of files without <code>.avro</code> extensions in read.<br> If the option is enabled, all files (with and without <code>.avro</code> extension) are loaded.</td>
    +    <td>read</td>
    +  </tr>
    +  <tr>
    +    <td><code>compression</code></td>
    +    <td>snappy</td>
    +    <td>The <code>compression</code> option allows to specify a compression codec used in write.<br>
    +  Currently supported codecs are <code>uncompressed</code>, <code>snappy</code>, <code>deflate</code>, <code>bzip2</code> and <code>xz</code>.<br> If the option is not set, the configuration <code>spark.sql.avro.compression.codec</code> config is taken into account.</td>
    +    <td>write</td>
    +  </tr>
    +</table>
    +
    +## Configuration
    +Configuration of Avro can be done using the `setConf` method on SparkSession or by running `SET key=value` commands using SQL.
    +<table class="table">
    +  <tr><th><b>Property Name</b></th><th><b>Default</b></th><th><b>Meaning</b></th></tr>
    +  <tr>
    +    <td>spark.sql.legacy.replaceDatabricksSparkAvro.enabled</td>
    +    <td>true</td>
    +    <td>If it is set to true, the data source provider <code>com.databricks.spark.avro</code> is mapped to the built-in but external Avro data source module for backward compatibility.</td>
    +  </tr>
    +  <tr>
    +    <td>spark.sql.avro.compression.codec</td>
    +    <td>snappy</td>
    +    <td>Compression codec used in writing of AVRO files. Supported codecs: uncompressed, deflate, snappy, bzip2 and xz. Default codec is snappy.</td>
    +  </tr>
    +  <tr>
    +    <td>spark.sql.avro.deflate.level</td>
    +    <td>-1</td>
    +    <td>Compression level for the deflate codec used in writing of AVRO files. Valid value must be in the range of from 1 to 9 inclusive or -1. The default value is -1 which corresponds to 6 level in the current implementation.</td>
    +  </tr>
    +</table>
    +
    +## Compatibility with Databricks spark-avro
    +This Avro data source module is originally from and compatible with Databricks's open source repository 
    +[spark-avro](https://github.com/databricks/spark-avro).
    +
    +By default with the SQL configuration `spark.sql.legacy.replaceDatabricksSparkAvro.enabled` enabled, the data source provider `com.databricks.spark.avro` is 
    +mapped to this built-in Avro module. For the Spark tables created with `Provider` property as `com.databricks.spark.avro` in 
    +catalog meta store, the mapping is essential to load these tables if you are using this built-in Avro module. 
    +
    +Note in Databricks's [spark-avro](https://github.com/databricks/spark-avro), implicit classes 
    +`AvroDataFrameWriter` and `AvroDataFrameReader` were created for shortcut function `.avro()`. In this 
    +built-in but external module, both implicit classes are removed. Please use `.format("avro")` in 
    +`DataFrameWriter` or `DataFrameReader` instead, which should be clean and good enough.
    +
    +If you prefer using your own build of `spark-avro` jar file, you can simply disable the configuration 
    +`spark.sql.legacy.replaceDatabricksSparkAvro.enabled`, and use the option `--jars` on deploying your 
    +applications. Read the [Advanced Dependency Management](https://spark.apache
    +.org/docs/latest/submitting-applications.html#advanced-dependency-management) section in Application 
    +Submission Guide for more details. 
    +
    +## Supported types for Avro -> Spark SQL conversion
    +Currently Spark supports reading all [primitive types](https://avro.apache.org/docs/1.8.2/spec.html#schema_primitive) and [complex types](https://avro.apache.org/docs/1.8.2/spec.html#schema_complex) of Avro.
    +<table class="table">
    +  <tr><th><b>Avro type</b></th><th><b>Spark SQL type</b></th></tr>
    +  <tr>
    +    <td>boolean</td>
    +    <td>BooleanType</td>
    +  </tr>
    +  <tr>
    +    <td>int</td>
    +    <td>IntegerType</td>
    +  </tr>
    +  <tr>
    +    <td>long</td>
    +    <td>LongType</td>
    +  </tr>
    +  <tr>
    +    <td>float</td>
    +    <td>FloatType</td>
    +  </tr>
    +  <tr>
    +    <td>double</td>
    +    <td>DoubleType</td>
    +  </tr>
    +  <tr>
    +    <td>string</td>
    +    <td>StringType</td>
    +  </tr>
    +  <tr>
    +    <td>enum</td>
    +    <td>StringType</td>
    +  </tr>
    +  <tr>
    +    <td>fixed</td>
    +    <td>BinaryType</td>
    +  </tr>
    +  <tr>
    +    <td>bytes</td>
    +    <td>BinaryType</td>
    +  </tr>
    +  <tr>
    +    <td>record</td>
    +    <td>StructType</td>
    +  </tr>
    +  <tr>
    +    <td>array</td>
    +    <td>ArrayType</td>
    +  </tr>
    +  <tr>
    +    <td>map</td>
    +    <td>MapType</td>
    +  </tr>
    +  <tr>
    +    <td>union</td>
    +    <td>See below</td>
    +  </tr>
    +</table>
    +
    +In addition to the types listed above, it supports reading `union` types. The following three types are considered basic `union` types:
    +
    +1. `union(int, long)` will be mapped to LongType.
    +2. `union(float, double)` will be mapped to DoubleType.
    +3. `union(something, null)`, where something is any supported Avro type. This will be mapped to the same Spark SQL type as that of something, with nullable set to true.
    +All other union types are considered complex. They will be mapped to StructType where field names are member0, member1, etc., in accordance with members of the union. This is consistent with the behavior when converting between Avro and Parquet.
    +
    +It also supports reading the following Avro [logical types](https://avro.apache.org/docs/1.8.2/spec.html#Logical+Types):
    +
    +<table class="table">
    +  <tr><th><b>Avro logical type</b></th><th><b>Avro type</b></th><th><b>Spark SQL type</b></th></tr>
    +  <tr>
    +    <td>date</td>
    +    <td>int</td>
    +    <td>DateType</td>
    +  </tr>
    +  <tr>
    +    <td>timestamp-millis</td>
    +    <td>long</td>
    +    <td>TimestampType</td>
    +  </tr>
    +  <tr>
    +    <td>timestamp-micros</td>
    +    <td>long</td>
    +    <td>TimestampType</td>
    +  </tr>
    +  <tr>
    +    <td>decimal</td>
    +    <td>fixed</td>
    +    <td>DecimalType</td>
    +  </tr>
    +  <tr>
    +    <td>decimal</td>
    +    <td>bytes</td>
    +    <td>DecimalType</td>
    +  </tr>
    +</table>
    +At the moment, it ignores docs, aliases and other properties present in the Avro file.
    +
    +## Supported types for Spark SQL -> Avro conversion
    +Spark supports writing of all Spark SQL types into Avro. For most types, the mapping from Spark types to Avro types is straightforward (e.g. IntegerType gets converted to int); however, there are a few special cases which are listed below:
    +
    +<table class="table">
    +<tr><th><b>Spark SQL type</b></th><th><b>Avro type</b></th><th><b>Avro logical type</b></th></tr>
    +  <tr>
    +    <td>ByteType</td>
    +    <td>int</td>
    +    <td></td>
    +  </tr>
    +  <tr>
    +    <td>ShortType</td>
    +    <td>int</td>
    +    <td></td>
    +  </tr>
    +  <tr>
    +    <td>BinaryType</td>
    +    <td>bytes</td>
    +    <td></td>
    +  </tr>
    +  <tr>
    +    <td>Date</td>
    --- End diff --
    
    DateType


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22121: [SPARK-25133][SQL][Doc]Avro data source guide

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22121
  
    **[Test build #95139 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95139/testReport)** for PR 22121 at commit [`8245806`](https://github.com/apache/spark/commit/824580684c05c2a3c1654517b77864ca5d504ee0).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22121: [SPARK-25133][SQL][Doc]Avro data source guide

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22121
  
    **[Test build #95100 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95100/testReport)** for PR 22121 at commit [`d9c5352`](https://github.com/apache/spark/commit/d9c5352c8ffc70d271a8aa68c3ffec41b4158ece).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22121: [SPARK-25133][SQL][Doc]Avro data source guide

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22121
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22121: [SPARK-25133][SQL][Doc]Avro data source guide

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22121
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2473/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22121: [SPARK-25133][SQL][Doc]AVRO data source guide

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22121
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94850/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22121: [SPARK-25133][SQL][Doc]Avro data source guide

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22121#discussion_r210922590
  
    --- Diff: docs/avro-data-source-guide.md ---
    @@ -0,0 +1,267 @@
    +---
    +layout: global
    +title: Avro Data Source Guide
    +---
    +
    +Since Spark 2.4 release, [Spark SQL](https://spark.apache.org/docs/latest/sql-programming-guide.html) provides support for reading and writing Avro data.
    +
    +## Deploying
    +The <code>spark-avro</code> module is external and not included in `spark-submit` or `spark-shell` by default.
    +
    +As with any Spark applications, `spark-submit` is used to launch your application. `spark-avro_{{site.SCALA_BINARY_VERSION}}`
    +and its dependencies can be directly added to `spark-submit` using `--packages`, such as,
    +
    +    ./bin/spark-submit --packages org.apache.spark:spark-avro_{{site.SCALA_BINARY_VERSION}}:{{site.SPARK_VERSION_SHORT}} ...
    +
    +For experimenting on `spark-shell`, you can also use `--packages` to add `org.apache.spark:spark-avro_{{site.SCALA_BINARY_VERSION}}` and its dependencies directly,
    +
    +    ./bin/spark-shell --packages org.apache.spark:spark-avro_{{site.SCALA_BINARY_VERSION}}:{{site.SPARK_VERSION_SHORT}} ...
    +
    +See [Application Submission Guide](submitting-applications.html) for more details about submitting applications with external dependencies.
    +
    +## Examples
    +
    +Since `spark-avro` module is external, there is not such API as <code>.avro</code> in 
    +<code>DataFrameReader</code> or <code>DataFrameWriter</code>.
    +To load/save data in Avro format, you need to specify the data source option <code>format</code> as short name <code>avro</code> or full name <code>org.apache.spark.sql.avro</code>.
    --- End diff --
    
    You can use back-ticks rather than `<code>` for simpler code formatting. No big deal either way.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22121: [SPARK-25133][SQL][Doc]Avro data source guide

Posted by gengliangwang <gi...@git.apache.org>.

Github user gengliangwang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22121#discussion_r212043748
  
    --- Diff: docs/avro-data-source-guide.md ---
    @@ -0,0 +1,377 @@
    +---
    +layout: global
    +title: Apache Avro Data Source Guide
    +---
    +
    +* This will become a table of contents (this text will be scraped).
    +{:toc}
    +
    +Since Spark 2.4 release, [Spark SQL](https://spark.apache.org/docs/latest/sql-programming-guide.html) provides built-in support for reading and writing Apache Avro data.
    +
    +## Deploying
    +The `spark-avro` module is external and not included in `spark-submit` or `spark-shell` by default.
    +
    +As with any Spark applications, `spark-submit` is used to launch your application. `spark-avro_{{site.SCALA_BINARY_VERSION}}`
    +and its dependencies can be directly added to `spark-submit` using `--packages`, such as,
    +
    +    ./bin/spark-submit --packages org.apache.spark:spark-avro_{{site.SCALA_BINARY_VERSION}}:{{site.SPARK_VERSION_SHORT}} ...
    +
    +For experimenting on `spark-shell`, you can also use `--packages` to add `org.apache.spark:spark-avro_{{site.SCALA_BINARY_VERSION}}` and its dependencies directly,
    +
    +    ./bin/spark-shell --packages org.apache.spark:spark-avro_{{site.SCALA_BINARY_VERSION}}:{{site.SPARK_VERSION_SHORT}} ...
    +
    +See [Application Submission Guide](submitting-applications.html) for more details about submitting applications with external dependencies.
    +
    +## Load and Save Functions
    +
    +Since `spark-avro` module is external, there is no `.avro` API in 
    +`DataFrameReader` or `DataFrameWriter`.
    +
    +To load/save data in Avro format, you need to specify the data source option `format` as `avro`(or `org.apache.spark.sql.avro`).
    +<div class="codetabs">
    +<div data-lang="scala" markdown="1">
    +{% highlight scala %}
    +
    +val usersDF = spark.read.format("avro").load("examples/src/main/resources/users.avro")
    +usersDF.select("name", "favorite_color").write.format("avro").save("namesAndFavColors.avro")
    +
    +{% endhighlight %}
    +</div>
    +<div data-lang="java" markdown="1">
    +{% highlight java %}
    +
    +Dataset<Row> usersDF = spark.read().format("avro").load("examples/src/main/resources/users.avro");
    +usersDF.select("name", "favorite_color").write().format("avro").save("namesAndFavColors.avro");
    +
    +{% endhighlight %}
    +</div>
    +<div data-lang="python" markdown="1">
    +{% highlight python %}
    +
    +df = spark.read.format("avro").load("examples/src/main/resources/users.avro")
    +df.select("name", "favorite_color").write.format("avro").save("namesAndFavColors.avro")
    +
    +{% endhighlight %}
    +</div>
    +<div data-lang="r" markdown="1">
    +{% highlight r %}
    +
    +df <- read.df("examples/src/main/resources/users.avro", "avro")
    +write.df(select(df, "name", "favorite_color"), "namesAndFavColors.avro", "avro")
    +
    +{% endhighlight %}
    +</div>
    +</div>
    +
    +## to_avro() and from_avro()
    +Spark SQL provides function `to_avro` to encode a struct as a string and `from_avro()` to retrieve the struct as a complex type.
    +
    +Using Avro record as columns are useful when reading from or writing to a streaming source like Kafka. Each 
    +Kafka key-value record will be augmented with some metadata, such as the ingestion timestamp into Kafka, the offset in Kafka, etc.
    +* If the "value" field that contains your data is in Avro, you could use `from_avro()` to extract your data, enrich it, clean it, and then push it downstream to Kafka again or write it out to a file.
    +* `to_avro()` can be used to turn structs into Avro records. This method is particularly useful when you would like to re-encode multiple columns into a single one when writing data out to Kafka.
    +
    +Both methods are presently only available in Scala and Java.
    +
    +<div class="codetabs">
    +<div data-lang="scala" markdown="1">
    +{% highlight scala %}
    +import org.apache.spark.sql.avro._
    +
    +// `from_avro` requires Avro schema in JSON string format.
    +val jsonFormatSchema = new String(Files.readAllBytes(Paths.get("./examples/src/main/resources/user.avsc")))
    +
    +val df = spark
    +  .readStream
    +  .format("kafka")
    +  .option("kafka.bootstrap.servers", "host1:port1,host2:port2")
    +  .option("subscribe", "topic1")
    +  .load()
    +
    +// 1. Decode the Avro data into a struct;
    +// 2. Filter by column `favorite_color`;
    +// 3. Encode the column `name` in Avro format.
    +val output = df
    +  .select(from_avro('value, jsonFormatSchema) as 'user)
    +  .where("user.favorite_color == \"red\"")
    +  .select(to_avro($"user.name") as 'value)
    +
    +val ds = output
    +  .writeStream
    +  .format("kafka")
    +  .option("kafka.bootstrap.servers", "host1:port1,host2:port2")
    +  .option("topic", "topic2")
    +  .start()
    +
    +{% endhighlight %}
    +</div>
    +<div data-lang="java" markdown="1">
    +{% highlight java %}
    +import org.apache.spark.sql.avro.*
    +
    +// `from_avro` requires Avro schema in JSON string format.
    +String jsonFormatSchema = new String(Files.readAllBytes(Paths.get("./examples/src/main/resources/user.avsc")))
    --- End diff --
    
    I think it should be OK to ignore `StandardCharsets.UTF_8`.
    The example code can be simple and just for demonstrating.
    The key part is about `to_avro` and `from_avro` here.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22121: [SPARK-25133][SQL][Doc]AVRO data source guide

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22121
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22121: [SPARK-25133][SQL][Doc]Avro data source guide

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22121
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95140/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22121: [SPARK-25133][SQL][Doc]Avro data source guide

Posted by gengliangwang <gi...@git.apache.org>.

Github user gengliangwang commented on the issue:

    https://github.com/apache/spark/pull/22121
  
    @srowen @tgravescs @gatorsmile @HyukjinKwon  @dongjoon-hyun Thanks for the reviews! I have added section `to_avro() and from_avro()` and `Compatibility with Databricks spark-avro`. 
    
    Also attach html file for preview, please check it in PR description.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22121: [SPARK-25133][SQL][Doc]Avro data source guide

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22121#discussion_r211986779
  
    --- Diff: docs/avro-data-source-guide.md ---
    @@ -0,0 +1,377 @@
    +---
    +layout: global
    +title: Apache Avro Data Source Guide
    +---
    +
    +* This will become a table of contents (this text will be scraped).
    +{:toc}
    +
    +Since Spark 2.4 release, [Spark SQL](https://spark.apache.org/docs/latest/sql-programming-guide.html) provides built-in support for reading and writing Apache Avro data.
    +
    +## Deploying
    +The `spark-avro` module is external and not included in `spark-submit` or `spark-shell` by default.
    +
    +As with any Spark applications, `spark-submit` is used to launch your application. `spark-avro_{{site.SCALA_BINARY_VERSION}}`
    +and its dependencies can be directly added to `spark-submit` using `--packages`, such as,
    +
    +    ./bin/spark-submit --packages org.apache.spark:spark-avro_{{site.SCALA_BINARY_VERSION}}:{{site.SPARK_VERSION_SHORT}} ...
    +
    +For experimenting on `spark-shell`, you can also use `--packages` to add `org.apache.spark:spark-avro_{{site.SCALA_BINARY_VERSION}}` and its dependencies directly,
    +
    +    ./bin/spark-shell --packages org.apache.spark:spark-avro_{{site.SCALA_BINARY_VERSION}}:{{site.SPARK_VERSION_SHORT}} ...
    +
    +See [Application Submission Guide](submitting-applications.html) for more details about submitting applications with external dependencies.
    +
    +## Load and Save Functions
    +
    +Since `spark-avro` module is external, there is no `.avro` API in 
    +`DataFrameReader` or `DataFrameWriter`.
    +
    +To load/save data in Avro format, you need to specify the data source option `format` as `avro`(or `org.apache.spark.sql.avro`).
    +<div class="codetabs">
    +<div data-lang="scala" markdown="1">
    +{% highlight scala %}
    +
    +val usersDF = spark.read.format("avro").load("examples/src/main/resources/users.avro")
    +usersDF.select("name", "favorite_color").write.format("avro").save("namesAndFavColors.avro")
    +
    +{% endhighlight %}
    +</div>
    +<div data-lang="java" markdown="1">
    +{% highlight java %}
    +
    +Dataset<Row> usersDF = spark.read().format("avro").load("examples/src/main/resources/users.avro");
    +usersDF.select("name", "favorite_color").write().format("avro").save("namesAndFavColors.avro");
    +
    +{% endhighlight %}
    +</div>
    +<div data-lang="python" markdown="1">
    +{% highlight python %}
    +
    +df = spark.read.format("avro").load("examples/src/main/resources/users.avro")
    +df.select("name", "favorite_color").write.format("avro").save("namesAndFavColors.avro")
    +
    +{% endhighlight %}
    +</div>
    +<div data-lang="r" markdown="1">
    +{% highlight r %}
    +
    +df <- read.df("examples/src/main/resources/users.avro", "avro")
    +write.df(select(df, "name", "favorite_color"), "namesAndFavColors.avro", "avro")
    +
    +{% endhighlight %}
    +</div>
    +</div>
    +
    +## to_avro() and from_avro()
    +Spark SQL provides function `to_avro` to encode a struct as a string and `from_avro()` to retrieve the struct as a complex type.
    +
    +Using Avro record as columns are useful when reading from or writing to a streaming source like Kafka. Each 
    +Kafka key-value record will be augmented with some metadata, such as the ingestion timestamp into Kafka, the offset in Kafka, etc.
    +* If the "value" field that contains your data is in Avro, you could use `from_avro()` to extract your data, enrich it, clean it, and then push it downstream to Kafka again or write it out to a file.
    +* `to_avro()` can be used to turn structs into Avro records. This method is particularly useful when you would like to re-encode multiple columns into a single one when writing data out to Kafka.
    +
    +Both methods are presently only available in Scala and Java.
    +
    +<div class="codetabs">
    +<div data-lang="scala" markdown="1">
    +{% highlight scala %}
    +import org.apache.spark.sql.avro._
    +
    +// `from_avro` requires Avro schema in JSON string format.
    +val jsonFormatSchema = new String(Files.readAllBytes(Paths.get("./examples/src/main/resources/user.avsc")))
    +
    +val df = spark
    +  .readStream
    +  .format("kafka")
    +  .option("kafka.bootstrap.servers", "host1:port1,host2:port2")
    +  .option("subscribe", "topic1")
    +  .load()
    +
    +// 1. Decode the Avro data into a struct;
    +// 2. Filter by column `favorite_color`;
    +// 3. Encode the column `name` in Avro format.
    +val output = df
    +  .select(from_avro('value, jsonFormatSchema) as 'user)
    +  .where("user.favorite_color == \"red\"")
    +  .select(to_avro($"user.name") as 'value)
    +
    +val ds = output
    +  .writeStream
    +  .format("kafka")
    +  .option("kafka.bootstrap.servers", "host1:port1,host2:port2")
    +  .option("topic", "topic2")
    +  .start()
    +
    +{% endhighlight %}
    +</div>
    +<div data-lang="java" markdown="1">
    +{% highlight java %}
    +import org.apache.spark.sql.avro.*
    +
    +// `from_avro` requires Avro schema in JSON string format.
    +String jsonFormatSchema = new String(Files.readAllBytes(Paths.get("./examples/src/main/resources/user.avsc")))
    +
    +Dataset<Row> df = spark
    +  .readStream()
    +  .format("kafka")
    +  .option("kafka.bootstrap.servers", "host1:port1,host2:port2")
    +  .option("subscribe", "topic1")
    +  .load()
    +
    +// 1. Decode the Avro data into a struct;
    +// 2. Filter by column `favorite_color`;
    +// 3. Encode the column `name` in Avro format.
    +DataFrame output = df
    +  .select(from_avro(col("value"), jsonFormatSchema).as("user"))
    +  .where("user.favorite_color == \"red\"")
    +  .select(to_avro(col("user.name")).as("value"))
    +
    +StreamingQuery ds = output
    +  .writeStream()
    +  .format("kafka")
    +  .option("kafka.bootstrap.servers", "host1:port1,host2:port2")
    +  .option("topic", "topic2")
    +  .start()
    +
    +{% endhighlight %}
    +</div>
    +</div>
    +
    +## Data Source Option
    +
    +Data source options of Avro can be set using the `.option` method on `DataFrameReader` or `DataFrameWriter`.
    +<table class="table">
    +  <tr><th><b>Property Name</b></th><th><b>Default</b></th><th><b>Meaning</b></th><th><b>Scope</b></th></tr>
    +  <tr>
    +    <td><code>avroSchema</code></td>
    +    <td>None</td>
    +    <td>Optional Avro schema provided by an user in JSON format.</td>
    --- End diff --
    
    We should mention the behavior when the specified schema doesn't match the real schema.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22121: [SPARK-25133][SQL][Doc]Avro data source guide

Posted by gengliangwang <gi...@git.apache.org>.

Github user gengliangwang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22121#discussion_r210973663
  
    --- Diff: docs/avro-data-source-guide.md ---
    @@ -0,0 +1,267 @@
    +---
    +layout: global
    +title: Avro Data Source Guide
    +---
    +
    +Since Spark 2.4 release, [Spark SQL](https://spark.apache.org/docs/latest/sql-programming-guide.html) provides support for reading and writing Avro data.
    +
    +## Deploying
    +The <code>spark-avro</code> module is external and not included in `spark-submit` or `spark-shell` by default.
    +
    +As with any Spark applications, `spark-submit` is used to launch your application. `spark-avro_{{site.SCALA_BINARY_VERSION}}`
    +and its dependencies can be directly added to `spark-submit` using `--packages`, such as,
    +
    +    ./bin/spark-submit --packages org.apache.spark:spark-avro_{{site.SCALA_BINARY_VERSION}}:{{site.SPARK_VERSION_SHORT}} ...
    +
    +For experimenting on `spark-shell`, you can also use `--packages` to add `org.apache.spark:spark-avro_{{site.SCALA_BINARY_VERSION}}` and its dependencies directly,
    +
    +    ./bin/spark-shell --packages org.apache.spark:spark-avro_{{site.SCALA_BINARY_VERSION}}:{{site.SPARK_VERSION_SHORT}} ...
    +
    +See [Application Submission Guide](submitting-applications.html) for more details about submitting applications with external dependencies.
    +
    +## Examples
    +
    +Since `spark-avro` module is external, there is not such API as <code>.avro</code> in 
    --- End diff --
    
    I see. I can change the title as read/write Avro data...


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22121: [SPARK-25133][SQL][Doc]Avro data source guide

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22121
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22121: [SPARK-25133][SQL][Doc]Avro data source guide

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/22121


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22121: [SPARK-25133][SQL][Doc]AVRO data source guide

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22121
  
    **[Test build #94850 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94850/testReport)** for PR 22121 at commit [`3d8220f`](https://github.com/apache/spark/commit/3d8220f1d9145fb6606bc16bf62cc92c2aaddb35).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22121: [SPARK-25133][SQL][Doc]Avro data source guide

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22121
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22121: [SPARK-25133][SQL][Doc]Avro data source guide

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22121#discussion_r211011718
  
    --- Diff: docs/avro-data-source-guide.md ---
    @@ -0,0 +1,260 @@
    +---
    +layout: global
    +title: Apache Avro Data Source Guide
    +---
    +
    +* This will become a table of contents (this text will be scraped).
    +{:toc}
    +
    +Since Spark 2.4 release, [Spark SQL](https://spark.apache.org/docs/latest/sql-programming-guide.html) provides built-in support for reading and writing Apache Avro data.
    +
    +## Deploying
    +The `spark-avro` module is external and not included in `spark-submit` or `spark-shell` by default.
    +
    +As with any Spark applications, `spark-submit` is used to launch your application. `spark-avro_{{site.SCALA_BINARY_VERSION}}`
    +and its dependencies can be directly added to `spark-submit` using `--packages`, such as,
    +
    +    ./bin/spark-submit --packages org.apache.spark:spark-avro_{{site.SCALA_BINARY_VERSION}}:{{site.SPARK_VERSION_SHORT}} ...
    +
    +For experimenting on `spark-shell`, you can also use `--packages` to add `org.apache.spark:spark-avro_{{site.SCALA_BINARY_VERSION}}` and its dependencies directly,
    +
    +    ./bin/spark-shell --packages org.apache.spark:spark-avro_{{site.SCALA_BINARY_VERSION}}:{{site.SPARK_VERSION_SHORT}} ...
    +
    +See [Application Submission Guide](submitting-applications.html) for more details about submitting applications with external dependencies.
    +
    +## Load/Save Functions
    +
    +Since `spark-avro` module is external, there is not such API as `.avro` in 
    +`DataFrameReader` or `DataFrameWriter`.
    +To load/save data in Avro format, you need to specify the data source option `format` as short name `avro` or full name `org.apache.spark.sql.avro`.
    +<div class="codetabs">
    +<div data-lang="scala" markdown="1">
    +{% highlight scala %}
    +
    +val usersDF = spark.read.format("avro").load("examples/src/main/resources/users.avro")
    +usersDF.select("name", "favorite_color").write.format("avro").save("namesAndFavColors.avro")
    +
    +{% endhighlight %}
    +</div>
    +<div data-lang="java" markdown="1">
    +{% highlight java %}
    +
    +Dataset<Row> usersDF = spark.read().format("avro").load("examples/src/main/resources/users.avro");
    +usersDF.select("name", "favorite_color").write().format("avro").save("namesAndFavColors.avro");
    +
    +{% endhighlight %}
    +</div>
    +<div data-lang="python" markdown="1">
    +{% highlight python %}
    +
    +df = spark.read.format("avro").load("examples/src/main/resources/users.avro")
    +df.select("name", "favorite_color").write.format("avro").save("namesAndFavColors.avro")
    +
    +{% endhighlight %}
    +</div>
    +<div data-lang="r" markdown="1">
    +{% highlight r %}
    +
    +df <- read.df("examples/src/main/resources/users.avro", "avro")
    +write.df(select(df, "name", "favorite_color"), "namesAndFavColors.avro", "avro")
    +
    +{% endhighlight %}
    +</div>
    +</div>
    +
    +## Data Source Options
    +
    +Data source options of Avro can be set using the `.option` method on `DataFrameReader` or `DataFrameWriter`.
    +<table class="table">
    +  <tr><th><b>Property Name</b></th><th><b>Default</b></th><th><b>Meaning</b></th><th><b>Scope</b></th></tr>
    +  <tr>
    +    <td><code>avroSchema</code></td>
    +    <td>None</td>
    +    <td>Optional Avro schema provided by an user in JSON format.</td>
    +    <td>read and write</td>
    +  </tr>
    +  <tr>
    +    <td><code>recordName</code></td>
    +    <td>topLevelRecord</td>
    +    <td>Top level record name in write result, which is required in Avro spec.</td>
    +    <td>write</td>
    +  </tr>
    +  <tr>
    +    <td><code>recordNamespace</code></td>
    +    <td>""</td>
    +    <td>Record namespace in write result.</td>
    +    <td>write</td>
    +  </tr>
    +  <tr>
    +    <td><code>ignoreExtension</code></td>
    +    <td>true</td>
    +    <td>The option controls ignoring of files without <code>.avro</code> extensions in read. If the option is enabled, all files (with and without <code>.avro</code> extension) are loaded.</td>
    +    <td>read</td>
    +  </tr>
    +  <tr>
    +    <td><code>compression</code></td>
    +    <td>snappy</td>
    +    <td>The <code>compression</code> option allows to specify a compression codec used in write. Currently supported codecs are <code>uncompressed</code>, <code>snappy</code>, <code>deflate</code>, <code>bzip2</code> and <code>xz</code>. If the option is not set, the configuration <code>spark.sql.avro.compression.codec</code> config is taken into account.</td>
    +    <td>write</td>
    +  </tr>
    +</table>
    +
    +## Supported types for Avro -> Spark SQL conversion
    +Currently Spark supports reading all [primitive types](https://avro.apache.org/docs/1.8.2/spec.html#schema_primitive) and [complex types](https://avro.apache.org/docs/1.8.2/spec.html#schema_complex) of Avro.
    +<table class="table">
    +  <tr><th><b>Avro type</b></th><th><b>Spark SQL type</b></th></tr>
    +  <tr>
    +    <td>boolean</td>
    +    <td>BooleanType</td>
    +  </tr>
    +  <tr>
    +    <td>int</td>
    +    <td>IntegerType</td>
    +  </tr>
    +  <tr>
    +    <td>long</td>
    +    <td>LongType</td>
    +  </tr>
    +  <tr>
    +    <td>float</td>
    +    <td>FloatType</td>
    +  </tr>
    +  <tr>
    +    <td>double</td>
    +    <td>DoubleType</td>
    +  </tr>
    +  <tr>
    +    <td>string</td>
    +    <td>StringType</td>
    +  </tr>
    +  <tr>
    +    <td>enum</td>
    +    <td>StringType</td>
    +  </tr>
    +  <tr>
    +    <td>fixed</td>
    +    <td>BinaryType</td>
    +  </tr>
    +  <tr>
    +    <td>bytes</td>
    +    <td>BinaryType</td>
    +  </tr>
    +  <tr>
    +    <td>record</td>
    +    <td>StructType</td>
    +  </tr>
    +  <tr>
    +    <td>array</td>
    +    <td>ArrayType</td>
    +  </tr>
    +  <tr>
    +    <td>map</td>
    +    <td>MapType</td>
    +  </tr>
    +  <tr>
    +    <td>union</td>
    +    <td>See below</td>
    +  </tr>
    +</table>
    +
    +In addition to the types listed above, it supports reading `union` types. The following three types are considered basic `union` types:
    +
    +1. `union(int, long)` will be mapped to LongType.
    +2. `union(float, double)` will be mapped to DoubleType.
    +3. `union(something, null)`, where something is any supported Avro type. This will be mapped to the same Spark SQL type as that of something, with nullable set to true.
    +All other union types are considered complex. They will be mapped to StructType where field names are member0, member1, etc., in accordance with members of the union. This is consistent with the behavior when converting between Avro and Parquet.
    +
    +It also supports reading the following Avro [logical types](https://avro.apache.org/docs/1.8.2/spec.html#Logical+Types):
    +
    +<table class="table">
    +  <tr><th><b>Avro logical type</b></th><th><b>Avro type</b></th><th><b>Spark SQL type</b></th></tr>
    +  <tr>
    +    <td>date</td>
    +    <td>int</td>
    +    <td>DateType</td>
    +  </tr>
    +  <tr>
    +    <td>timestamp-millis</td>
    +    <td>long</td>
    +    <td>TimestampType</td>
    +  </tr>
    +  <tr>
    +    <td>timestamp-micros</td>
    +    <td>long</td>
    +    <td>TimestampType</td>
    +  </tr>
    +  <tr>
    +    <td>decimal</td>
    +    <td>bytes</td>
    +    <td>DecimalType</td>
    +  </tr>
    +  <tr>
    +    <td>decimal</td>
    +    <td>bytes</td>
    +    <td>DecimalType</td>
    +  </tr>
    +</table>
    +At the moment, it ignores docs, aliases and other properties present in the Avro file.
    +
    +## Supported types for Spark SQL -> Avro conversion
    +Spark supports writing of all Spark SQL types into Avro. For most types, the mapping from Spark types to Avro types is straightforward (e.g. IntegerType gets converted to int); however, there are a few special cases which are listed below:
    +
    +<table class="table">
    +<tr><th><b>Spark SQL type</b></th><th><b>Avro type</b></th><th><b>Avro logical type</b></th></tr>
    +  <tr>
    +    <td>ByteType</td>
    +    <td>int</td>
    +    <td></td>
    +  </tr>
    +  <tr>
    +    <td>ShortType</td>
    +    <td>int</td>
    +    <td></td>
    +  </tr>
    +  <tr>
    +    <td>BinaryType</td>
    +    <td>bytes</td>
    +    <td></td>
    +  </tr>
    +  <tr>
    +    <td>Date</td>
    +    <td>int</td>
    +    <td>date</td>
    +  </tr>
    +  <tr>
    +    <td>TimestampType</td>
    +    <td>long</td>
    +    <td>timestamp-micros</td>
    +  </tr>
    +  <tr>
    +    <td>DecimalType</td>
    +    <td>fixed</td>
    +    <td>decimal</td>
    +  </tr>
    +</table>
    +
    +You can also specify the whole output Avro schema with the option `avroSchema`, so that Spark SQL types can be converted into other Avro types. The following conversions is not by default and require user specified Avro schema:
    --- End diff --
    
    `is not` -> `are not applied`?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22121: [SPARK-25133][SQL][Doc]Avro data source guide

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22121
  
    **[Test build #95105 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95105/testReport)** for PR 22121 at commit [`8da8250`](https://github.com/apache/spark/commit/8da82506e06e36d63bf91fdda194a866f2d977ea).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22121: [SPARK-25133][SQL][Doc]Avro data source guide

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22121
  
    **[Test build #95116 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95116/testReport)** for PR 22121 at commit [`581b7e6`](https://github.com/apache/spark/commit/581b7e60e70deac79a15e0a903a78deb10d4f7ac).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22121: [SPARK-25133][SQL][Doc]Avro data source guide

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22121
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2455/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22121: [SPARK-25133][SQL][Doc]Avro data source guide

Posted by tgravescs <gi...@git.apache.org>.

Github user tgravescs commented on a diff in the pull request:

https://github.com/apache/spark/pull/22121#discussion_r210981586

--- Diff: docs/avro-data-source-guide.md ---
@@ -0,0 +1,267 @@
+---
+layout: global
+title: Avro Data Source Guide
+---
+
+Since Spark 2.4 release, [Spark SQL](https://spark.apache.org/docs/latest/sql-programming-guide.html) provides support for reading and writing Avro data.
+
+## Deploying
+The <code>spark-avro</code> module is external and not included in `spark-submit` or `spark-shell` by default.
+
+As with any Spark applications, `spark-submit` is used to launch your application. `spark-avro_{{site.SCALA_BINARY_VERSION}}`
+and its dependencies can be directly added to `spark-submit` using `--packages`, such as,
--- End diff --

ok. When I see a deploying section I would expect it to tell me what my options are so perhaps just rephrasing to more indicate --packages is one way to do it.

It would be nice to at least have a general statement saying the external modules aren't including with spark by default, the user must include the necessary jars themselves. The way to do this will be deployment specific. One way of doing this is via the --packages option.

Note I think the structured-streaming-kafka section should ideally be updated to something similar as well. And really any external module for that matter. It would be nice to tell users how they can include these without assuming they just know how to.

---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22121: [SPARK-25133][SQL][Doc]Avro data source guide

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22121
  
    **[Test build #94886 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94886/testReport)** for PR 22121 at commit [`72c8ef2`](https://github.com/apache/spark/commit/72c8ef21d966ff2b2471a998323fd7b24278c12f).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22121: [SPARK-25133][SQL][Doc]AVRO data source guide

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22121
  
    **[Test build #94859 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94859/testReport)** for PR 22121 at commit [`030ca0f`](https://github.com/apache/spark/commit/030ca0fc95369eab9435f40ec769c68da9b1682a).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22121: [SPARK-25133][SQL][Doc]AVRO data source guide

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22121
  
    **[Test build #94859 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94859/testReport)** for PR 22121 at commit [`030ca0f`](https://github.com/apache/spark/commit/030ca0fc95369eab9435f40ec769c68da9b1682a).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22121: [SPARK-25133][SQL][Doc]Avro data source guide

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22121
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2271/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22121: [SPARK-25133][SQL][Doc]Avro data source guide

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22121
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95100/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22121: [SPARK-25133][SQL][Doc]Avro data source guide

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22121
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95099/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22121: [SPARK-25133][SQL][Doc]Avro data source guide

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22121
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22121: [WIP][SPARK-25133][SQL][Doc]Avro data source guid...

Posted by gengliangwang <gi...@git.apache.org>.

Github user gengliangwang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22121#discussion_r211127696
  
    --- Diff: docs/avro-data-source-guide.md ---
    @@ -0,0 +1,260 @@
    +---
    +layout: global
    +title: Apache Avro Data Source Guide
    +---
    +
    +* This will become a table of contents (this text will be scraped).
    +{:toc}
    +
    +Since Spark 2.4 release, [Spark SQL](https://spark.apache.org/docs/latest/sql-programming-guide.html) provides built-in support for reading and writing Apache Avro data.
    +
    +## Deploying
    +The `spark-avro` module is external and not included in `spark-submit` or `spark-shell` by default.
    +
    +As with any Spark applications, `spark-submit` is used to launch your application. `spark-avro_{{site.SCALA_BINARY_VERSION}}`
    +and its dependencies can be directly added to `spark-submit` using `--packages`, such as,
    +
    +    ./bin/spark-submit --packages org.apache.spark:spark-avro_{{site.SCALA_BINARY_VERSION}}:{{site.SPARK_VERSION_SHORT}} ...
    +
    +For experimenting on `spark-shell`, you can also use `--packages` to add `org.apache.spark:spark-avro_{{site.SCALA_BINARY_VERSION}}` and its dependencies directly,
    +
    +    ./bin/spark-shell --packages org.apache.spark:spark-avro_{{site.SCALA_BINARY_VERSION}}:{{site.SPARK_VERSION_SHORT}} ...
    +
    +See [Application Submission Guide](submitting-applications.html) for more details about submitting applications with external dependencies.
    +
    +## Load/Save Functions
    +
    +Since `spark-avro` module is external, there is not such API as `.avro` in 
    +`DataFrameReader` or `DataFrameWriter`.
    +To load/save data in Avro format, you need to specify the data source option `format` as short name `avro` or full name `org.apache.spark.sql.avro`.
    +<div class="codetabs">
    +<div data-lang="scala" markdown="1">
    +{% highlight scala %}
    +
    +val usersDF = spark.read.format("avro").load("examples/src/main/resources/users.avro")
    +usersDF.select("name", "favorite_color").write.format("avro").save("namesAndFavColors.avro")
    +
    +{% endhighlight %}
    +</div>
    +<div data-lang="java" markdown="1">
    +{% highlight java %}
    +
    +Dataset<Row> usersDF = spark.read().format("avro").load("examples/src/main/resources/users.avro");
    +usersDF.select("name", "favorite_color").write().format("avro").save("namesAndFavColors.avro");
    +
    +{% endhighlight %}
    +</div>
    +<div data-lang="python" markdown="1">
    +{% highlight python %}
    +
    +df = spark.read.format("avro").load("examples/src/main/resources/users.avro")
    +df.select("name", "favorite_color").write.format("avro").save("namesAndFavColors.avro")
    +
    +{% endhighlight %}
    +</div>
    +<div data-lang="r" markdown="1">
    +{% highlight r %}
    +
    +df <- read.df("examples/src/main/resources/users.avro", "avro")
    +write.df(select(df, "name", "favorite_color"), "namesAndFavColors.avro", "avro")
    +
    +{% endhighlight %}
    +</div>
    +</div>
    +
    +## Data Source Options
    +
    +Data source options of Avro can be set using the `.option` method on `DataFrameReader` or `DataFrameWriter`.
    +<table class="table">
    +  <tr><th><b>Property Name</b></th><th><b>Default</b></th><th><b>Meaning</b></th><th><b>Scope</b></th></tr>
    +  <tr>
    +    <td><code>avroSchema</code></td>
    +    <td>None</td>
    +    <td>Optional Avro schema provided by an user in JSON format.</td>
    +    <td>read and write</td>
    +  </tr>
    +  <tr>
    +    <td><code>recordName</code></td>
    +    <td>topLevelRecord</td>
    +    <td>Top level record name in write result, which is required in Avro spec.</td>
    +    <td>write</td>
    +  </tr>
    +  <tr>
    +    <td><code>recordNamespace</code></td>
    +    <td>""</td>
    +    <td>Record namespace in write result.</td>
    +    <td>write</td>
    +  </tr>
    +  <tr>
    +    <td><code>ignoreExtension</code></td>
    +    <td>true</td>
    +    <td>The option controls ignoring of files without <code>.avro</code> extensions in read. If the option is enabled, all files (with and without <code>.avro</code> extension) are loaded.</td>
    +    <td>read</td>
    +  </tr>
    +  <tr>
    +    <td><code>compression</code></td>
    +    <td>snappy</td>
    +    <td>The <code>compression</code> option allows to specify a compression codec used in write. Currently supported codecs are <code>uncompressed</code>, <code>snappy</code>, <code>deflate</code>, <code>bzip2</code> and <code>xz</code>. If the option is not set, the configuration <code>spark.sql.avro.compression.codec</code> config is taken into account.</td>
    --- End diff --
    
    For data source options, yes.
    For SQL configuration, I think the only one matters is the one in https://github.com/apache/spark/pull/22133. I am thinking of a better name for that configuration.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22121: [SPARK-25133][SQL][Doc]Avro data source guide

Posted by arunmahadevan <gi...@git.apache.org>.

Github user arunmahadevan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22121#discussion_r212031015
  
    --- Diff: docs/avro-data-source-guide.md ---
    @@ -0,0 +1,377 @@
    +---
    +layout: global
    +title: Apache Avro Data Source Guide
    +---
    +
    +* This will become a table of contents (this text will be scraped).
    +{:toc}
    +
    +Since Spark 2.4 release, [Spark SQL](https://spark.apache.org/docs/latest/sql-programming-guide.html) provides built-in support for reading and writing Apache Avro data.
    +
    +## Deploying
    +The `spark-avro` module is external and not included in `spark-submit` or `spark-shell` by default.
    +
    +As with any Spark applications, `spark-submit` is used to launch your application. `spark-avro_{{site.SCALA_BINARY_VERSION}}`
    +and its dependencies can be directly added to `spark-submit` using `--packages`, such as,
    +
    +    ./bin/spark-submit --packages org.apache.spark:spark-avro_{{site.SCALA_BINARY_VERSION}}:{{site.SPARK_VERSION_SHORT}} ...
    +
    +For experimenting on `spark-shell`, you can also use `--packages` to add `org.apache.spark:spark-avro_{{site.SCALA_BINARY_VERSION}}` and its dependencies directly,
    +
    +    ./bin/spark-shell --packages org.apache.spark:spark-avro_{{site.SCALA_BINARY_VERSION}}:{{site.SPARK_VERSION_SHORT}} ...
    +
    +See [Application Submission Guide](submitting-applications.html) for more details about submitting applications with external dependencies.
    +
    +## Load and Save Functions
    +
    +Since `spark-avro` module is external, there is no `.avro` API in 
    +`DataFrameReader` or `DataFrameWriter`.
    +
    +To load/save data in Avro format, you need to specify the data source option `format` as `avro`(or `org.apache.spark.sql.avro`).
    +<div class="codetabs">
    +<div data-lang="scala" markdown="1">
    +{% highlight scala %}
    +
    +val usersDF = spark.read.format("avro").load("examples/src/main/resources/users.avro")
    +usersDF.select("name", "favorite_color").write.format("avro").save("namesAndFavColors.avro")
    +
    +{% endhighlight %}
    +</div>
    +<div data-lang="java" markdown="1">
    +{% highlight java %}
    +
    +Dataset<Row> usersDF = spark.read().format("avro").load("examples/src/main/resources/users.avro");
    +usersDF.select("name", "favorite_color").write().format("avro").save("namesAndFavColors.avro");
    +
    +{% endhighlight %}
    +</div>
    +<div data-lang="python" markdown="1">
    +{% highlight python %}
    +
    +df = spark.read.format("avro").load("examples/src/main/resources/users.avro")
    +df.select("name", "favorite_color").write.format("avro").save("namesAndFavColors.avro")
    +
    +{% endhighlight %}
    +</div>
    +<div data-lang="r" markdown="1">
    +{% highlight r %}
    +
    +df <- read.df("examples/src/main/resources/users.avro", "avro")
    +write.df(select(df, "name", "favorite_color"), "namesAndFavColors.avro", "avro")
    +
    +{% endhighlight %}
    +</div>
    +</div>
    +
    +## to_avro() and from_avro()
    +Spark SQL provides function `to_avro` to encode a struct as a string and `from_avro()` to retrieve the struct as a complex type.
    --- End diff --
    
    does it need to be a struct or any spark sql type? 
    maybe: `to_avro` to encode spark sql types as avro bytes and `from_avro` to retrieve avro bytes as spark sql types?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22121: [SPARK-25133][SQL][Doc]Avro data source guide

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22121
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2288/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22121: [WIP][SPARK-25133][SQL][Doc]Avro data source guid...

Posted by gengliangwang <gi...@git.apache.org>.

Github user gengliangwang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22121#discussion_r211128707
  
    --- Diff: docs/avro-data-source-guide.md ---
    @@ -0,0 +1,260 @@
    +---
    +layout: global
    +title: Apache Avro Data Source Guide
    +---
    +
    +* This will become a table of contents (this text will be scraped).
    +{:toc}
    +
    +Since Spark 2.4 release, [Spark SQL](https://spark.apache.org/docs/latest/sql-programming-guide.html) provides built-in support for reading and writing Apache Avro data.
    +
    +## Deploying
    +The `spark-avro` module is external and not included in `spark-submit` or `spark-shell` by default.
    +
    +As with any Spark applications, `spark-submit` is used to launch your application. `spark-avro_{{site.SCALA_BINARY_VERSION}}`
    +and its dependencies can be directly added to `spark-submit` using `--packages`, such as,
    +
    +    ./bin/spark-submit --packages org.apache.spark:spark-avro_{{site.SCALA_BINARY_VERSION}}:{{site.SPARK_VERSION_SHORT}} ...
    +
    +For experimenting on `spark-shell`, you can also use `--packages` to add `org.apache.spark:spark-avro_{{site.SCALA_BINARY_VERSION}}` and its dependencies directly,
    +
    +    ./bin/spark-shell --packages org.apache.spark:spark-avro_{{site.SCALA_BINARY_VERSION}}:{{site.SPARK_VERSION_SHORT}} ...
    +
    +See [Application Submission Guide](submitting-applications.html) for more details about submitting applications with external dependencies.
    +
    +## Load/Save Functions
    +
    +Since `spark-avro` module is external, there is not such API as `.avro` in 
    +`DataFrameReader` or `DataFrameWriter`.
    +To load/save data in Avro format, you need to specify the data source option `format` as short name `avro` or full name `org.apache.spark.sql.avro`.
    +<div class="codetabs">
    +<div data-lang="scala" markdown="1">
    +{% highlight scala %}
    +
    +val usersDF = spark.read.format("avro").load("examples/src/main/resources/users.avro")
    +usersDF.select("name", "favorite_color").write.format("avro").save("namesAndFavColors.avro")
    +
    +{% endhighlight %}
    +</div>
    +<div data-lang="java" markdown="1">
    +{% highlight java %}
    +
    +Dataset<Row> usersDF = spark.read().format("avro").load("examples/src/main/resources/users.avro");
    +usersDF.select("name", "favorite_color").write().format("avro").save("namesAndFavColors.avro");
    +
    +{% endhighlight %}
    +</div>
    +<div data-lang="python" markdown="1">
    +{% highlight python %}
    +
    +df = spark.read.format("avro").load("examples/src/main/resources/users.avro")
    +df.select("name", "favorite_color").write.format("avro").save("namesAndFavColors.avro")
    +
    +{% endhighlight %}
    +</div>
    +<div data-lang="r" markdown="1">
    +{% highlight r %}
    +
    +df <- read.df("examples/src/main/resources/users.avro", "avro")
    +write.df(select(df, "name", "favorite_color"), "namesAndFavColors.avro", "avro")
    +
    +{% endhighlight %}
    +</div>
    +</div>
    +
    +## Data Source Options
    +
    +Data source options of Avro can be set using the `.option` method on `DataFrameReader` or `DataFrameWriter`.
    +<table class="table">
    +  <tr><th><b>Property Name</b></th><th><b>Default</b></th><th><b>Meaning</b></th><th><b>Scope</b></th></tr>
    +  <tr>
    +    <td><code>avroSchema</code></td>
    +    <td>None</td>
    +    <td>Optional Avro schema provided by an user in JSON format.</td>
    +    <td>read and write</td>
    +  </tr>
    +  <tr>
    +    <td><code>recordName</code></td>
    +    <td>topLevelRecord</td>
    +    <td>Top level record name in write result, which is required in Avro spec.</td>
    +    <td>write</td>
    +  </tr>
    +  <tr>
    +    <td><code>recordNamespace</code></td>
    +    <td>""</td>
    +    <td>Record namespace in write result.</td>
    +    <td>write</td>
    +  </tr>
    +  <tr>
    +    <td><code>ignoreExtension</code></td>
    +    <td>true</td>
    +    <td>The option controls ignoring of files without <code>.avro</code> extensions in read. If the option is enabled, all files (with and without <code>.avro</code> extension) are loaded.</td>
    +    <td>read</td>
    +  </tr>
    +  <tr>
    +    <td><code>compression</code></td>
    +    <td>snappy</td>
    +    <td>The <code>compression</code> option allows to specify a compression codec used in write. Currently supported codecs are <code>uncompressed</code>, <code>snappy</code>, <code>deflate</code>, <code>bzip2</code> and <code>xz</code>. If the option is not set, the configuration <code>spark.sql.avro.compression.codec</code> config is taken into account.</td>
    --- End diff --
    
    I will add a section for the SQL configurations.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22121: [SPARK-25133][SQL][Doc]Avro data source guide

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22121
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22121: [SPARK-25133][SQL][Doc]Avro data source guide

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22121
  
    **[Test build #95113 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95113/testReport)** for PR 22121 at commit [`006ea40`](https://github.com/apache/spark/commit/006ea40ce0d7a3939241c6e0126732e9cebb59ca).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22121: [SPARK-25133][SQL][Doc]Avro data source guide

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22121
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2302/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22121: [SPARK-25133][SQL][Doc]Avro data source guide

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22121
  
    **[Test build #94905 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94905/testReport)** for PR 22121 at commit [`ff6d3ab`](https://github.com/apache/spark/commit/ff6d3abf1b1f4dec6ec29266a325a2f7bd4fdd05).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22121: [SPARK-25133][SQL][Doc]Avro data source guide

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22121
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2440/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22121: [SPARK-25133][SQL][Doc]Avro data source guide

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22121#discussion_r211985059
  
    --- Diff: docs/avro-data-source-guide.md ---
    @@ -0,0 +1,377 @@
    +---
    +layout: global
    +title: Apache Avro Data Source Guide
    +---
    +
    +* This will become a table of contents (this text will be scraped).
    +{:toc}
    +
    +Since Spark 2.4 release, [Spark SQL](https://spark.apache.org/docs/latest/sql-programming-guide.html) provides built-in support for reading and writing Apache Avro data.
    +
    +## Deploying
    +The `spark-avro` module is external and not included in `spark-submit` or `spark-shell` by default.
    +
    +As with any Spark applications, `spark-submit` is used to launch your application. `spark-avro_{{site.SCALA_BINARY_VERSION}}`
    +and its dependencies can be directly added to `spark-submit` using `--packages`, such as,
    +
    +    ./bin/spark-submit --packages org.apache.spark:spark-avro_{{site.SCALA_BINARY_VERSION}}:{{site.SPARK_VERSION_SHORT}} ...
    +
    +For experimenting on `spark-shell`, you can also use `--packages` to add `org.apache.spark:spark-avro_{{site.SCALA_BINARY_VERSION}}` and its dependencies directly,
    +
    +    ./bin/spark-shell --packages org.apache.spark:spark-avro_{{site.SCALA_BINARY_VERSION}}:{{site.SPARK_VERSION_SHORT}} ...
    +
    +See [Application Submission Guide](submitting-applications.html) for more details about submitting applications with external dependencies.
    +
    +## Load and Save Functions
    +
    +Since `spark-avro` module is external, there is no `.avro` API in 
    +`DataFrameReader` or `DataFrameWriter`.
    +
    +To load/save data in Avro format, you need to specify the data source option `format` as `avro`(or `org.apache.spark.sql.avro`).
    +<div class="codetabs">
    +<div data-lang="scala" markdown="1">
    +{% highlight scala %}
    +
    +val usersDF = spark.read.format("avro").load("examples/src/main/resources/users.avro")
    +usersDF.select("name", "favorite_color").write.format("avro").save("namesAndFavColors.avro")
    +
    +{% endhighlight %}
    +</div>
    +<div data-lang="java" markdown="1">
    +{% highlight java %}
    +
    +Dataset<Row> usersDF = spark.read().format("avro").load("examples/src/main/resources/users.avro");
    +usersDF.select("name", "favorite_color").write().format("avro").save("namesAndFavColors.avro");
    +
    +{% endhighlight %}
    +</div>
    +<div data-lang="python" markdown="1">
    +{% highlight python %}
    +
    +df = spark.read.format("avro").load("examples/src/main/resources/users.avro")
    +df.select("name", "favorite_color").write.format("avro").save("namesAndFavColors.avro")
    +
    +{% endhighlight %}
    +</div>
    +<div data-lang="r" markdown="1">
    +{% highlight r %}
    +
    +df <- read.df("examples/src/main/resources/users.avro", "avro")
    +write.df(select(df, "name", "favorite_color"), "namesAndFavColors.avro", "avro")
    +
    +{% endhighlight %}
    +</div>
    +</div>
    +
    +## to_avro() and from_avro()
    +Spark SQL provides function `to_avro` to encode a struct as a string and `from_avro()` to retrieve the struct as a complex type.
    --- End diff --
    
    `encode a struct as a string`, I think it's not "string", but "binary"?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22121: [SPARK-25133][SQL][Doc]Avro data source guide

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/22121
  
    @gatorsmile does this address your comment about documenting new features? 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22121: [SPARK-25133][SQL][Doc]AVRO data source guide

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22121
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2246/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22121: [SPARK-25133][SQL][Doc]Avro data source guide

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22121
  
    **[Test build #95100 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95100/testReport)** for PR 22121 at commit [`d9c5352`](https://github.com/apache/spark/commit/d9c5352c8ffc70d271a8aa68c3ffec41b4158ece).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22121: [WIP][SPARK-25133][SQL][Doc]Avro data source guid...

Posted by tgravescs <gi...@git.apache.org>.

Github user tgravescs commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22121#discussion_r211773604
  
    --- Diff: docs/avro-data-source-guide.md ---
    @@ -0,0 +1,260 @@
    +---
    +layout: global
    +title: Apache Avro Data Source Guide
    +---
    +
    +* This will become a table of contents (this text will be scraped).
    +{:toc}
    +
    +Since Spark 2.4 release, [Spark SQL](https://spark.apache.org/docs/latest/sql-programming-guide.html) provides built-in support for reading and writing Apache Avro data.
    +
    --- End diff --
    
    note I think we should add a compatibility section here, reference https://github.com/apache/spark/pull/22133
    



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22121: [SPARK-25133][SQL][Doc]Avro data source guide

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22121#discussion_r211986370
  
    --- Diff: docs/avro-data-source-guide.md ---
    @@ -0,0 +1,377 @@
    +---
    +layout: global
    +title: Apache Avro Data Source Guide
    +---
    +
    +* This will become a table of contents (this text will be scraped).
    +{:toc}
    +
    +Since Spark 2.4 release, [Spark SQL](https://spark.apache.org/docs/latest/sql-programming-guide.html) provides built-in support for reading and writing Apache Avro data.
    +
    +## Deploying
    +The `spark-avro` module is external and not included in `spark-submit` or `spark-shell` by default.
    +
    +As with any Spark applications, `spark-submit` is used to launch your application. `spark-avro_{{site.SCALA_BINARY_VERSION}}`
    +and its dependencies can be directly added to `spark-submit` using `--packages`, such as,
    +
    +    ./bin/spark-submit --packages org.apache.spark:spark-avro_{{site.SCALA_BINARY_VERSION}}:{{site.SPARK_VERSION_SHORT}} ...
    +
    +For experimenting on `spark-shell`, you can also use `--packages` to add `org.apache.spark:spark-avro_{{site.SCALA_BINARY_VERSION}}` and its dependencies directly,
    +
    +    ./bin/spark-shell --packages org.apache.spark:spark-avro_{{site.SCALA_BINARY_VERSION}}:{{site.SPARK_VERSION_SHORT}} ...
    +
    +See [Application Submission Guide](submitting-applications.html) for more details about submitting applications with external dependencies.
    +
    +## Load and Save Functions
    +
    +Since `spark-avro` module is external, there is no `.avro` API in 
    +`DataFrameReader` or `DataFrameWriter`.
    +
    +To load/save data in Avro format, you need to specify the data source option `format` as `avro`(or `org.apache.spark.sql.avro`).
    +<div class="codetabs">
    +<div data-lang="scala" markdown="1">
    +{% highlight scala %}
    +
    +val usersDF = spark.read.format("avro").load("examples/src/main/resources/users.avro")
    +usersDF.select("name", "favorite_color").write.format("avro").save("namesAndFavColors.avro")
    +
    +{% endhighlight %}
    +</div>
    +<div data-lang="java" markdown="1">
    +{% highlight java %}
    +
    +Dataset<Row> usersDF = spark.read().format("avro").load("examples/src/main/resources/users.avro");
    +usersDF.select("name", "favorite_color").write().format("avro").save("namesAndFavColors.avro");
    +
    +{% endhighlight %}
    +</div>
    +<div data-lang="python" markdown="1">
    +{% highlight python %}
    +
    +df = spark.read.format("avro").load("examples/src/main/resources/users.avro")
    +df.select("name", "favorite_color").write.format("avro").save("namesAndFavColors.avro")
    +
    +{% endhighlight %}
    +</div>
    +<div data-lang="r" markdown="1">
    +{% highlight r %}
    +
    +df <- read.df("examples/src/main/resources/users.avro", "avro")
    +write.df(select(df, "name", "favorite_color"), "namesAndFavColors.avro", "avro")
    +
    +{% endhighlight %}
    +</div>
    +</div>
    +
    +## to_avro() and from_avro()
    +Spark SQL provides function `to_avro` to encode a struct as a string and `from_avro()` to retrieve the struct as a complex type.
    +
    +Using Avro record as columns are useful when reading from or writing to a streaming source like Kafka. Each 
    +Kafka key-value record will be augmented with some metadata, such as the ingestion timestamp into Kafka, the offset in Kafka, etc.
    +* If the "value" field that contains your data is in Avro, you could use `from_avro()` to extract your data, enrich it, clean it, and then push it downstream to Kafka again or write it out to a file.
    +* `to_avro()` can be used to turn structs into Avro records. This method is particularly useful when you would like to re-encode multiple columns into a single one when writing data out to Kafka.
    +
    +Both methods are presently only available in Scala and Java.
    +
    +<div class="codetabs">
    +<div data-lang="scala" markdown="1">
    +{% highlight scala %}
    +import org.apache.spark.sql.avro._
    +
    +// `from_avro` requires Avro schema in JSON string format.
    +val jsonFormatSchema = new String(Files.readAllBytes(Paths.get("./examples/src/main/resources/user.avsc")))
    +
    +val df = spark
    +  .readStream
    +  .format("kafka")
    +  .option("kafka.bootstrap.servers", "host1:port1,host2:port2")
    +  .option("subscribe", "topic1")
    +  .load()
    +
    +// 1. Decode the Avro data into a struct;
    +// 2. Filter by column `favorite_color`;
    +// 3. Encode the column `name` in Avro format.
    +val output = df
    +  .select(from_avro('value, jsonFormatSchema) as 'user)
    +  .where("user.favorite_color == \"red\"")
    +  .select(to_avro($"user.name") as 'value)
    +
    +val ds = output
    +  .writeStream
    +  .format("kafka")
    +  .option("kafka.bootstrap.servers", "host1:port1,host2:port2")
    +  .option("topic", "topic2")
    +  .start()
    +
    +{% endhighlight %}
    +</div>
    +<div data-lang="java" markdown="1">
    +{% highlight java %}
    +import org.apache.spark.sql.avro.*
    +
    +// `from_avro` requires Avro schema in JSON string format.
    +String jsonFormatSchema = new String(Files.readAllBytes(Paths.get("./examples/src/main/resources/user.avsc")))
    +
    +Dataset<Row> df = spark
    +  .readStream()
    +  .format("kafka")
    +  .option("kafka.bootstrap.servers", "host1:port1,host2:port2")
    +  .option("subscribe", "topic1")
    +  .load()
    +
    +// 1. Decode the Avro data into a struct;
    +// 2. Filter by column `favorite_color`;
    +// 3. Encode the column `name` in Avro format.
    +DataFrame output = df
    --- End diff --
    
    Are you sure this compiles in Java?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22121: [SPARK-25133][SQL][Doc]Avro data source guide

Posted by gengliangwang <gi...@git.apache.org>.

Github user gengliangwang commented on the issue:

    https://github.com/apache/spark/pull/22121
  
    @tgravescs @srowen @gatorsmile Thanks for the reviewing. I will keep updating this.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22121: [SPARK-25133][SQL][Doc]Avro data source guide

Posted by tgravescs <gi...@git.apache.org>.

Github user tgravescs commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22121#discussion_r211959406
  
    --- Diff: docs/avro-data-source-guide.md ---
    @@ -0,0 +1,377 @@
    +---
    +layout: global
    +title: Apache Avro Data Source Guide
    +---
    +
    +* This will become a table of contents (this text will be scraped).
    +{:toc}
    +
    +Since Spark 2.4 release, [Spark SQL](https://spark.apache.org/docs/latest/sql-programming-guide.html) provides built-in support for reading and writing Apache Avro data.
    +
    +## Deploying
    +The `spark-avro` module is external and not included in `spark-submit` or `spark-shell` by default.
    +
    +As with any Spark applications, `spark-submit` is used to launch your application. `spark-avro_{{site.SCALA_BINARY_VERSION}}`
    +and its dependencies can be directly added to `spark-submit` using `--packages`, such as,
    +
    +    ./bin/spark-submit --packages org.apache.spark:spark-avro_{{site.SCALA_BINARY_VERSION}}:{{site.SPARK_VERSION_SHORT}} ...
    +
    +For experimenting on `spark-shell`, you can also use `--packages` to add `org.apache.spark:spark-avro_{{site.SCALA_BINARY_VERSION}}` and its dependencies directly,
    +
    +    ./bin/spark-shell --packages org.apache.spark:spark-avro_{{site.SCALA_BINARY_VERSION}}:{{site.SPARK_VERSION_SHORT}} ...
    +
    +See [Application Submission Guide](submitting-applications.html) for more details about submitting applications with external dependencies.
    +
    +## Load and Save Functions
    +
    +Since `spark-avro` module is external, there is not such API as `.avro` in 
    --- End diff --
    
    there is no '.avro' API in


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22121: [SPARK-25133][SQL][Doc]AVRO data source guide

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22121
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22121: [SPARK-25133][SQL][Doc]Avro data source guide

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22121
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22121: [SPARK-25133][SQL][Doc]Avro data source guide

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22121
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94928/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22121: [WIP][SPARK-25133][SQL][Doc]Avro data source guide

Posted by gengliangwang <gi...@git.apache.org>.

Github user gengliangwang commented on the issue:

    https://github.com/apache/spark/pull/22121
  
    @srowen Hi Sean, I will add content for new features soon. I also updated the title. 
    Thanks.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22121: [SPARK-25133][SQL][Doc]AVRO data source guide

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/22121
  
    @gengliangwang Could you also post the screen shot in your PR description?



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22121: [WIP][SPARK-25133][SQL][Doc]Avro data source guid...

Posted by gengliangwang <gi...@git.apache.org>.

Github user gengliangwang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22121#discussion_r211875231
  
    --- Diff: docs/avro-data-source-guide.md ---
    @@ -0,0 +1,260 @@
    +---
    +layout: global
    +title: Apache Avro Data Source Guide
    +---
    +
    +* This will become a table of contents (this text will be scraped).
    +{:toc}
    +
    +Since Spark 2.4 release, [Spark SQL](https://spark.apache.org/docs/latest/sql-programming-guide.html) provides built-in support for reading and writing Apache Avro data.
    +
    --- End diff --
    
    @tgravescs I have add an independent section for it :)


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22121: [SPARK-25133][SQL][Doc]Avro data source guide

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22121
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22121: [SPARK-25133][SQL][Doc]Avro data source guide

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22121
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22121: [SPARK-25133][SQL][Doc]AVRO data source guide

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22121
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2254/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22121: [SPARK-25133][SQL][Doc]Avro data source guide

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22121
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22121: [SPARK-25133][SQL][Doc]Avro data source guide

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/22121
  
    We should do the same thing for the other native sources. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22121: [SPARK-25133][SQL][Doc]Avro data source guide

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22121
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22121: [SPARK-25133][SQL][Doc]Avro data source guide

Posted by tgravescs <gi...@git.apache.org>.

Github user tgravescs commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22121#discussion_r210917715
  
    --- Diff: docs/avro-data-source-guide.md ---
    @@ -0,0 +1,267 @@
    +---
    +layout: global
    +title: Avro Data Source Guide
    +---
    +
    +Since Spark 2.4 release, [Spark SQL](https://spark.apache.org/docs/latest/sql-programming-guide.html) provides support for reading and writing Avro data.
    +
    +## Deploying
    +The <code>spark-avro</code> module is external and not included in `spark-submit` or `spark-shell` by default.
    +
    +As with any Spark applications, `spark-submit` is used to launch your application. `spark-avro_{{site.SCALA_BINARY_VERSION}}`
    +and its dependencies can be directly added to `spark-submit` using `--packages`, such as,
    --- End diff --
    
    should we also mention you can include with --jars if you build the jar?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22121: [SPARK-25133][SQL][Doc]Avro data source guide

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/22121
  
    thanks, merging to master!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22121: [SPARK-25133][SQL][Doc]Avro data source guide

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22121#discussion_r211989166
  
    --- Diff: docs/avro-data-source-guide.md ---
    @@ -0,0 +1,377 @@
    +---
    +layout: global
    +title: Apache Avro Data Source Guide
    +---
    +
    +* This will become a table of contents (this text will be scraped).
    +{:toc}
    +
    +Since Spark 2.4 release, [Spark SQL](https://spark.apache.org/docs/latest/sql-programming-guide.html) provides built-in support for reading and writing Apache Avro data.
    +
    +## Deploying
    +The `spark-avro` module is external and not included in `spark-submit` or `spark-shell` by default.
    +
    +As with any Spark applications, `spark-submit` is used to launch your application. `spark-avro_{{site.SCALA_BINARY_VERSION}}`
    +and its dependencies can be directly added to `spark-submit` using `--packages`, such as,
    +
    +    ./bin/spark-submit --packages org.apache.spark:spark-avro_{{site.SCALA_BINARY_VERSION}}:{{site.SPARK_VERSION_SHORT}} ...
    +
    +For experimenting on `spark-shell`, you can also use `--packages` to add `org.apache.spark:spark-avro_{{site.SCALA_BINARY_VERSION}}` and its dependencies directly,
    +
    +    ./bin/spark-shell --packages org.apache.spark:spark-avro_{{site.SCALA_BINARY_VERSION}}:{{site.SPARK_VERSION_SHORT}} ...
    +
    +See [Application Submission Guide](submitting-applications.html) for more details about submitting applications with external dependencies.
    +
    +## Load and Save Functions
    +
    +Since `spark-avro` module is external, there is no `.avro` API in 
    +`DataFrameReader` or `DataFrameWriter`.
    +
    +To load/save data in Avro format, you need to specify the data source option `format` as `avro`(or `org.apache.spark.sql.avro`).
    +<div class="codetabs">
    +<div data-lang="scala" markdown="1">
    +{% highlight scala %}
    +
    +val usersDF = spark.read.format("avro").load("examples/src/main/resources/users.avro")
    +usersDF.select("name", "favorite_color").write.format("avro").save("namesAndFavColors.avro")
    +
    +{% endhighlight %}
    +</div>
    +<div data-lang="java" markdown="1">
    +{% highlight java %}
    +
    +Dataset<Row> usersDF = spark.read().format("avro").load("examples/src/main/resources/users.avro");
    +usersDF.select("name", "favorite_color").write().format("avro").save("namesAndFavColors.avro");
    +
    +{% endhighlight %}
    +</div>
    +<div data-lang="python" markdown="1">
    +{% highlight python %}
    +
    +df = spark.read.format("avro").load("examples/src/main/resources/users.avro")
    +df.select("name", "favorite_color").write.format("avro").save("namesAndFavColors.avro")
    +
    +{% endhighlight %}
    +</div>
    +<div data-lang="r" markdown="1">
    +{% highlight r %}
    +
    +df <- read.df("examples/src/main/resources/users.avro", "avro")
    +write.df(select(df, "name", "favorite_color"), "namesAndFavColors.avro", "avro")
    +
    +{% endhighlight %}
    +</div>
    +</div>
    +
    +## to_avro() and from_avro()
    +Spark SQL provides function `to_avro` to encode a struct as a string and `from_avro()` to retrieve the struct as a complex type.
    +
    +Using Avro record as columns are useful when reading from or writing to a streaming source like Kafka. Each 
    +Kafka key-value record will be augmented with some metadata, such as the ingestion timestamp into Kafka, the offset in Kafka, etc.
    +* If the "value" field that contains your data is in Avro, you could use `from_avro()` to extract your data, enrich it, clean it, and then push it downstream to Kafka again or write it out to a file.
    +* `to_avro()` can be used to turn structs into Avro records. This method is particularly useful when you would like to re-encode multiple columns into a single one when writing data out to Kafka.
    +
    +Both methods are presently only available in Scala and Java.
    +
    +<div class="codetabs">
    +<div data-lang="scala" markdown="1">
    +{% highlight scala %}
    +import org.apache.spark.sql.avro._
    +
    +// `from_avro` requires Avro schema in JSON string format.
    +val jsonFormatSchema = new String(Files.readAllBytes(Paths.get("./examples/src/main/resources/user.avsc")))
    +
    +val df = spark
    +  .readStream
    +  .format("kafka")
    +  .option("kafka.bootstrap.servers", "host1:port1,host2:port2")
    +  .option("subscribe", "topic1")
    +  .load()
    +
    +// 1. Decode the Avro data into a struct;
    +// 2. Filter by column `favorite_color`;
    +// 3. Encode the column `name` in Avro format.
    +val output = df
    +  .select(from_avro('value, jsonFormatSchema) as 'user)
    +  .where("user.favorite_color == \"red\"")
    +  .select(to_avro($"user.name") as 'value)
    +
    +val ds = output
    +  .writeStream
    +  .format("kafka")
    +  .option("kafka.bootstrap.servers", "host1:port1,host2:port2")
    +  .option("topic", "topic2")
    +  .start()
    +
    +{% endhighlight %}
    +</div>
    +<div data-lang="java" markdown="1">
    +{% highlight java %}
    +import org.apache.spark.sql.avro.*
    --- End diff --
    
    Semicolon at end of line (all statements in java)


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22121: [SPARK-25133][SQL][Doc]Avro data source guide

Posted by tgravescs <gi...@git.apache.org>.

Github user tgravescs commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22121#discussion_r210919750
  
    --- Diff: docs/avro-data-source-guide.md ---
    @@ -0,0 +1,267 @@
    +---
    +layout: global
    +title: Avro Data Source Guide
    +---
    +
    +Since Spark 2.4 release, [Spark SQL](https://spark.apache.org/docs/latest/sql-programming-guide.html) provides support for reading and writing Avro data.
    +
    +## Deploying
    +The <code>spark-avro</code> module is external and not included in `spark-submit` or `spark-shell` by default.
    +
    +As with any Spark applications, `spark-submit` is used to launch your application. `spark-avro_{{site.SCALA_BINARY_VERSION}}`
    +and its dependencies can be directly added to `spark-submit` using `--packages`, such as,
    +
    +    ./bin/spark-submit --packages org.apache.spark:spark-avro_{{site.SCALA_BINARY_VERSION}}:{{site.SPARK_VERSION_SHORT}} ...
    +
    +For experimenting on `spark-shell`, you can also use `--packages` to add `org.apache.spark:spark-avro_{{site.SCALA_BINARY_VERSION}}` and its dependencies directly,
    +
    +    ./bin/spark-shell --packages org.apache.spark:spark-avro_{{site.SCALA_BINARY_VERSION}}:{{site.SPARK_VERSION_SHORT}} ...
    +
    +See [Application Submission Guide](submitting-applications.html) for more details about submitting applications with external dependencies.
    +
    +## Examples
    +
    +Since `spark-avro` module is external, there is not such API as <code>.avro</code> in 
    +<code>DataFrameReader</code> or <code>DataFrameWriter</code>.
    +To load/save data in Avro format, you need to specify the data source option <code>format</code> as short name <code>avro</code> or full name <code>org.apache.spark.sql.avro</code>.
    +<div class="codetabs">
    +<div data-lang="scala" markdown="1">
    +{% highlight scala %}
    +
    +val usersDF = spark.read.format("avro").load("examples/src/main/resources/users.avro")
    +usersDF.select("name", "favorite_color").write.format("avro").save("namesAndFavColors.avro")
    +
    +{% endhighlight %}
    +</div>
    +<div data-lang="java" markdown="1">
    +{% highlight java %}
    +
    +Dataset<Row> usersDF = spark.read().format("avro").load("examples/src/main/resources/users.avro");
    +usersDF.select("name", "favorite_color").write().format("avro").save("namesAndFavColors.avro");
    +
    +{% endhighlight %}
    +</div>
    +<div data-lang="python" markdown="1">
    +{% highlight python %}
    +
    +df = spark.read.format("avro").load("examples/src/main/resources/users.avro")
    +df.select("name", "favorite_color").write.format("avro").save("namesAndFavColors.avro")
    +
    +{% endhighlight %}
    +</div>
    +<div data-lang="r" markdown="1">
    +{% highlight r %}
    +
    +df <- read.df("examples/src/main/resources/users.avro", "avro")
    +write.df(select(df, "name", "favorite_color"), "namesAndFavColors.avro", "avro")
    +
    +{% endhighlight %}
    +</div>
    +</div>
    +
    +## Configuration
    +<table class="table">
    +  <tr><th><b>Property Name</b></th><th><b>Default</b></th><th><b>Meaning</b></th><th><b>Scope</b></th></tr>
    +  <tr>
    +    <td><code>avroSchema</code></td>
    --- End diff --
    
    the configuration here has not spark. prefix?  this is set via the .option interface?
    I think we should clarify that for the user vs later in the table you have the spark. configs that I assume aren't set via option but via --conf


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22121: [SPARK-25133][SQL][Doc]Avro data source guide

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22121#discussion_r211985668
  
    --- Diff: docs/avro-data-source-guide.md ---
    @@ -0,0 +1,377 @@
    +---
    +layout: global
    +title: Apache Avro Data Source Guide
    +---
    +
    +* This will become a table of contents (this text will be scraped).
    +{:toc}
    +
    +Since Spark 2.4 release, [Spark SQL](https://spark.apache.org/docs/latest/sql-programming-guide.html) provides built-in support for reading and writing Apache Avro data.
    +
    +## Deploying
    +The `spark-avro` module is external and not included in `spark-submit` or `spark-shell` by default.
    +
    +As with any Spark applications, `spark-submit` is used to launch your application. `spark-avro_{{site.SCALA_BINARY_VERSION}}`
    +and its dependencies can be directly added to `spark-submit` using `--packages`, such as,
    +
    +    ./bin/spark-submit --packages org.apache.spark:spark-avro_{{site.SCALA_BINARY_VERSION}}:{{site.SPARK_VERSION_SHORT}} ...
    +
    +For experimenting on `spark-shell`, you can also use `--packages` to add `org.apache.spark:spark-avro_{{site.SCALA_BINARY_VERSION}}` and its dependencies directly,
    +
    +    ./bin/spark-shell --packages org.apache.spark:spark-avro_{{site.SCALA_BINARY_VERSION}}:{{site.SPARK_VERSION_SHORT}} ...
    +
    +See [Application Submission Guide](submitting-applications.html) for more details about submitting applications with external dependencies.
    +
    +## Load and Save Functions
    +
    +Since `spark-avro` module is external, there is no `.avro` API in 
    +`DataFrameReader` or `DataFrameWriter`.
    +
    +To load/save data in Avro format, you need to specify the data source option `format` as `avro`(or `org.apache.spark.sql.avro`).
    +<div class="codetabs">
    +<div data-lang="scala" markdown="1">
    +{% highlight scala %}
    +
    +val usersDF = spark.read.format("avro").load("examples/src/main/resources/users.avro")
    +usersDF.select("name", "favorite_color").write.format("avro").save("namesAndFavColors.avro")
    +
    +{% endhighlight %}
    +</div>
    +<div data-lang="java" markdown="1">
    +{% highlight java %}
    +
    +Dataset<Row> usersDF = spark.read().format("avro").load("examples/src/main/resources/users.avro");
    +usersDF.select("name", "favorite_color").write().format("avro").save("namesAndFavColors.avro");
    +
    +{% endhighlight %}
    +</div>
    +<div data-lang="python" markdown="1">
    +{% highlight python %}
    +
    +df = spark.read.format("avro").load("examples/src/main/resources/users.avro")
    +df.select("name", "favorite_color").write.format("avro").save("namesAndFavColors.avro")
    +
    +{% endhighlight %}
    +</div>
    +<div data-lang="r" markdown="1">
    +{% highlight r %}
    +
    +df <- read.df("examples/src/main/resources/users.avro", "avro")
    +write.df(select(df, "name", "favorite_color"), "namesAndFavColors.avro", "avro")
    +
    +{% endhighlight %}
    +</div>
    +</div>
    +
    +## to_avro() and from_avro()
    +Spark SQL provides function `to_avro` to encode a struct as a string and `from_avro()` to retrieve the struct as a complex type.
    +
    +Using Avro record as columns are useful when reading from or writing to a streaming source like Kafka. Each 
    +Kafka key-value record will be augmented with some metadata, such as the ingestion timestamp into Kafka, the offset in Kafka, etc.
    +* If the "value" field that contains your data is in Avro, you could use `from_avro()` to extract your data, enrich it, clean it, and then push it downstream to Kafka again or write it out to a file.
    +* `to_avro()` can be used to turn structs into Avro records. This method is particularly useful when you would like to re-encode multiple columns into a single one when writing data out to Kafka.
    +
    +Both methods are presently only available in Scala and Java.
    --- End diff --
    
    Do not use `presently`, we should say `As of Spark 2.4, ...`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22121: [SPARK-25133][SQL][Doc]Avro data source guide

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22121#discussion_r211011168
  
    --- Diff: docs/avro-data-source-guide.md ---
    @@ -0,0 +1,260 @@
    +---
    +layout: global
    +title: Apache Avro Data Source Guide
    +---
    +
    +* This will become a table of contents (this text will be scraped).
    +{:toc}
    +
    +Since Spark 2.4 release, [Spark SQL](https://spark.apache.org/docs/latest/sql-programming-guide.html) provides built-in support for reading and writing Apache Avro data.
    +
    +## Deploying
    +The `spark-avro` module is external and not included in `spark-submit` or `spark-shell` by default.
    +
    +As with any Spark applications, `spark-submit` is used to launch your application. `spark-avro_{{site.SCALA_BINARY_VERSION}}`
    +and its dependencies can be directly added to `spark-submit` using `--packages`, such as,
    +
    +    ./bin/spark-submit --packages org.apache.spark:spark-avro_{{site.SCALA_BINARY_VERSION}}:{{site.SPARK_VERSION_SHORT}} ...
    +
    +For experimenting on `spark-shell`, you can also use `--packages` to add `org.apache.spark:spark-avro_{{site.SCALA_BINARY_VERSION}}` and its dependencies directly,
    +
    +    ./bin/spark-shell --packages org.apache.spark:spark-avro_{{site.SCALA_BINARY_VERSION}}:{{site.SPARK_VERSION_SHORT}} ...
    +
    +See [Application Submission Guide](submitting-applications.html) for more details about submitting applications with external dependencies.
    +
    +## Load/Save Functions
    +
    +Since `spark-avro` module is external, there is not such API as `.avro` in 
    +`DataFrameReader` or `DataFrameWriter`.
    +To load/save data in Avro format, you need to specify the data source option `format` as short name `avro` or full name `org.apache.spark.sql.avro`.
    +<div class="codetabs">
    +<div data-lang="scala" markdown="1">
    +{% highlight scala %}
    +
    +val usersDF = spark.read.format("avro").load("examples/src/main/resources/users.avro")
    +usersDF.select("name", "favorite_color").write.format("avro").save("namesAndFavColors.avro")
    +
    +{% endhighlight %}
    +</div>
    +<div data-lang="java" markdown="1">
    +{% highlight java %}
    +
    +Dataset<Row> usersDF = spark.read().format("avro").load("examples/src/main/resources/users.avro");
    +usersDF.select("name", "favorite_color").write().format("avro").save("namesAndFavColors.avro");
    +
    +{% endhighlight %}
    +</div>
    +<div data-lang="python" markdown="1">
    +{% highlight python %}
    +
    +df = spark.read.format("avro").load("examples/src/main/resources/users.avro")
    +df.select("name", "favorite_color").write.format("avro").save("namesAndFavColors.avro")
    +
    +{% endhighlight %}
    +</div>
    +<div data-lang="r" markdown="1">
    +{% highlight r %}
    +
    +df <- read.df("examples/src/main/resources/users.avro", "avro")
    +write.df(select(df, "name", "favorite_color"), "namesAndFavColors.avro", "avro")
    +
    +{% endhighlight %}
    +</div>
    +</div>
    +
    +## Data Source Options
    +
    +Data source options of Avro can be set using the `.option` method on `DataFrameReader` or `DataFrameWriter`.
    +<table class="table">
    +  <tr><th><b>Property Name</b></th><th><b>Default</b></th><th><b>Meaning</b></th><th><b>Scope</b></th></tr>
    +  <tr>
    +    <td><code>avroSchema</code></td>
    +    <td>None</td>
    +    <td>Optional Avro schema provided by an user in JSON format.</td>
    +    <td>read and write</td>
    +  </tr>
    +  <tr>
    +    <td><code>recordName</code></td>
    +    <td>topLevelRecord</td>
    +    <td>Top level record name in write result, which is required in Avro spec.</td>
    +    <td>write</td>
    +  </tr>
    +  <tr>
    +    <td><code>recordNamespace</code></td>
    +    <td>""</td>
    +    <td>Record namespace in write result.</td>
    +    <td>write</td>
    +  </tr>
    +  <tr>
    +    <td><code>ignoreExtension</code></td>
    +    <td>true</td>
    +    <td>The option controls ignoring of files without <code>.avro</code> extensions in read. If the option is enabled, all files (with and without <code>.avro</code> extension) are loaded.</td>
    +    <td>read</td>
    +  </tr>
    +  <tr>
    +    <td><code>compression</code></td>
    +    <td>snappy</td>
    +    <td>The <code>compression</code> option allows to specify a compression codec used in write. Currently supported codecs are <code>uncompressed</code>, <code>snappy</code>, <code>deflate</code>, <code>bzip2</code> and <code>xz</code>. If the option is not set, the configuration <code>spark.sql.avro.compression.codec</code> config is taken into account.</td>
    +    <td>write</td>
    +  </tr>
    +</table>
    +
    +## Supported types for Avro -> Spark SQL conversion
    +Currently Spark supports reading all [primitive types](https://avro.apache.org/docs/1.8.2/spec.html#schema_primitive) and [complex types](https://avro.apache.org/docs/1.8.2/spec.html#schema_complex) of Avro.
    +<table class="table">
    +  <tr><th><b>Avro type</b></th><th><b>Spark SQL type</b></th></tr>
    +  <tr>
    +    <td>boolean</td>
    +    <td>BooleanType</td>
    +  </tr>
    +  <tr>
    +    <td>int</td>
    +    <td>IntegerType</td>
    +  </tr>
    +  <tr>
    +    <td>long</td>
    +    <td>LongType</td>
    +  </tr>
    +  <tr>
    +    <td>float</td>
    +    <td>FloatType</td>
    +  </tr>
    +  <tr>
    +    <td>double</td>
    +    <td>DoubleType</td>
    +  </tr>
    +  <tr>
    +    <td>string</td>
    +    <td>StringType</td>
    +  </tr>
    +  <tr>
    +    <td>enum</td>
    +    <td>StringType</td>
    +  </tr>
    +  <tr>
    +    <td>fixed</td>
    +    <td>BinaryType</td>
    +  </tr>
    +  <tr>
    +    <td>bytes</td>
    +    <td>BinaryType</td>
    +  </tr>
    +  <tr>
    +    <td>record</td>
    +    <td>StructType</td>
    +  </tr>
    +  <tr>
    +    <td>array</td>
    +    <td>ArrayType</td>
    +  </tr>
    +  <tr>
    +    <td>map</td>
    +    <td>MapType</td>
    +  </tr>
    +  <tr>
    +    <td>union</td>
    +    <td>See below</td>
    +  </tr>
    +</table>
    +
    +In addition to the types listed above, it supports reading `union` types. The following three types are considered basic `union` types:
    +
    +1. `union(int, long)` will be mapped to LongType.
    +2. `union(float, double)` will be mapped to DoubleType.
    +3. `union(something, null)`, where something is any supported Avro type. This will be mapped to the same Spark SQL type as that of something, with nullable set to true.
    +All other union types are considered complex. They will be mapped to StructType where field names are member0, member1, etc., in accordance with members of the union. This is consistent with the behavior when converting between Avro and Parquet.
    +
    +It also supports reading the following Avro [logical types](https://avro.apache.org/docs/1.8.2/spec.html#Logical+Types):
    +
    +<table class="table">
    +  <tr><th><b>Avro logical type</b></th><th><b>Avro type</b></th><th><b>Spark SQL type</b></th></tr>
    +  <tr>
    +    <td>date</td>
    +    <td>int</td>
    +    <td>DateType</td>
    +  </tr>
    +  <tr>
    +    <td>timestamp-millis</td>
    +    <td>long</td>
    +    <td>TimestampType</td>
    +  </tr>
    +  <tr>
    +    <td>timestamp-micros</td>
    +    <td>long</td>
    +    <td>TimestampType</td>
    +  </tr>
    +  <tr>
    +    <td>decimal</td>
    +    <td>bytes</td>
    +    <td>DecimalType</td>
    +  </tr>
    +  <tr>
    +    <td>decimal</td>
    +    <td>bytes</td>
    +    <td>DecimalType</td>
    +  </tr>
    --- End diff --
    
    Could you remove the repetition, line 191 ~ 195?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22121: [SPARK-25133][SQL][Doc]AVRO data source guide

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22121
  
    **[Test build #94850 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94850/testReport)** for PR 22121 at commit [`3d8220f`](https://github.com/apache/spark/commit/3d8220f1d9145fb6606bc16bf62cc92c2aaddb35).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22121: [SPARK-25133][SQL][Doc]Avro data source guide

Posted by gengliangwang <gi...@git.apache.org>.

Github user gengliangwang commented on the issue:

    https://github.com/apache/spark/pull/22121
  
    The preview doc (zip file in PR description) is updated to latest version.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22121: [SPARK-25133][SQL][Doc]Avro data source guide

Posted by gengliangwang <gi...@git.apache.org>.

Github user gengliangwang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22121#discussion_r210972278
  
    --- Diff: docs/avro-data-source-guide.md ---
    @@ -0,0 +1,267 @@
    +---
    +layout: global
    +title: Avro Data Source Guide
    +---
    +
    +Since Spark 2.4 release, [Spark SQL](https://spark.apache.org/docs/latest/sql-programming-guide.html) provides support for reading and writing Avro data.
    +
    +## Deploying
    +The <code>spark-avro</code> module is external and not included in `spark-submit` or `spark-shell` by default.
    +
    +As with any Spark applications, `spark-submit` is used to launch your application. `spark-avro_{{site.SCALA_BINARY_VERSION}}`
    +and its dependencies can be directly added to `spark-submit` using `--packages`, such as,
    --- End diff --
    
    Here I am following https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html#deploying . 
    Using `--packages` ensures that this library and its dependencies will be added to the classpath, which should be good enough for general users.
    For users build their jar, they are supposed to know the general option `--jars`.
    I can add it if you insist. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22121: [SPARK-25133][SQL][Doc]Avro data source guide

Posted by gengliangwang <gi...@git.apache.org>.

Github user gengliangwang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22121#discussion_r211007298
  
    --- Diff: docs/avro-data-source-guide.md ---
    @@ -0,0 +1,267 @@
    +---
    +layout: global
    +title: Avro Data Source Guide
    +---
    +
    +Since Spark 2.4 release, [Spark SQL](https://spark.apache.org/docs/latest/sql-programming-guide.html) provides support for reading and writing Avro data.
    +
    +## Deploying
    +The <code>spark-avro</code> module is external and not included in `spark-submit` or `spark-shell` by default.
    +
    +As with any Spark applications, `spark-submit` is used to launch your application. `spark-avro_{{site.SCALA_BINARY_VERSION}}`
    +and its dependencies can be directly added to `spark-submit` using `--packages`, such as,
    --- End diff --
    
    Actually the `--jars` option is well explained in https://spark.apache.org/docs/latest/submitting-applications.html#advanced-dependency-management . And the doc url is also mentioned in both Deploying sections.
    I still feel it is unnecessary to have a short introduction about `--jars` option here.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22121: [SPARK-25133][SQL][Doc]Avro data source guide

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22121#discussion_r210970616
  
    --- Diff: docs/avro-data-source-guide.md ---
    @@ -0,0 +1,267 @@
    +---
    +layout: global
    +title: Avro Data Source Guide
    +---
    +
    +Since Spark 2.4 release, [Spark SQL](https://spark.apache.org/docs/latest/sql-programming-guide.html) provides support for reading and writing Avro data.
    --- End diff --
    
    `support` -> `built-in support`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22121: [SPARK-25133][SQL][Doc]Avro data source guide

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22121
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22121: [SPARK-25133][SQL][Doc]Avro data source guide

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22121
  
    **[Test build #95105 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95105/testReport)** for PR 22121 at commit [`8da8250`](https://github.com/apache/spark/commit/8da82506e06e36d63bf91fdda194a866f2d977ea).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22121: [SPARK-25133][SQL][Doc]Avro data source guide

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22121
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22121: [SPARK-25133][SQL][Doc]Avro data source guide

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22121#discussion_r211940709
  
    --- Diff: docs/avro-data-source-guide.md ---
    @@ -0,0 +1,377 @@
    +---
    +layout: global
    +title: Apache Avro Data Source Guide
    +---
    +
    +* This will become a table of contents (this text will be scraped).
    +{:toc}
    +
    +Since Spark 2.4 release, [Spark SQL](https://spark.apache.org/docs/latest/sql-programming-guide.html) provides built-in support for reading and writing Apache Avro data.
    +
    +## Deploying
    +The `spark-avro` module is external and not included in `spark-submit` or `spark-shell` by default.
    +
    +As with any Spark applications, `spark-submit` is used to launch your application. `spark-avro_{{site.SCALA_BINARY_VERSION}}`
    +and its dependencies can be directly added to `spark-submit` using `--packages`, such as,
    +
    +    ./bin/spark-submit --packages org.apache.spark:spark-avro_{{site.SCALA_BINARY_VERSION}}:{{site.SPARK_VERSION_SHORT}} ...
    +
    +For experimenting on `spark-shell`, you can also use `--packages` to add `org.apache.spark:spark-avro_{{site.SCALA_BINARY_VERSION}}` and its dependencies directly,
    +
    +    ./bin/spark-shell --packages org.apache.spark:spark-avro_{{site.SCALA_BINARY_VERSION}}:{{site.SPARK_VERSION_SHORT}} ...
    +
    +See [Application Submission Guide](submitting-applications.html) for more details about submitting applications with external dependencies.
    +
    +## Load and Save Functions
    +
    +Since `spark-avro` module is external, there is not such API as `.avro` in 
    +`DataFrameReader` or `DataFrameWriter`.
    +
    +To load/save data in Avro format, you need to specify the data source option `format` as `avro`(or `org.apache.spark.sql.avro`).
    +<div class="codetabs">
    +<div data-lang="scala" markdown="1">
    +{% highlight scala %}
    +
    +val usersDF = spark.read.format("avro").load("examples/src/main/resources/users.avro")
    +usersDF.select("name", "favorite_color").write.format("avro").save("namesAndFavColors.avro")
    +
    +{% endhighlight %}
    +</div>
    +<div data-lang="java" markdown="1">
    +{% highlight java %}
    +
    +Dataset<Row> usersDF = spark.read().format("avro").load("examples/src/main/resources/users.avro");
    +usersDF.select("name", "favorite_color").write().format("avro").save("namesAndFavColors.avro");
    +
    +{% endhighlight %}
    +</div>
    +<div data-lang="python" markdown="1">
    +{% highlight python %}
    +
    +df = spark.read.format("avro").load("examples/src/main/resources/users.avro")
    +df.select("name", "favorite_color").write.format("avro").save("namesAndFavColors.avro")
    +
    +{% endhighlight %}
    +</div>
    +<div data-lang="r" markdown="1">
    +{% highlight r %}
    +
    +df <- read.df("examples/src/main/resources/users.avro", "avro")
    +write.df(select(df, "name", "favorite_color"), "namesAndFavColors.avro", "avro")
    +
    +{% endhighlight %}
    +</div>
    +</div>
    +
    +## to_avro() and from_avro()
    +Spark SQL provides function `to_avro` to encode a struct as a string and `from_avro()` to retrieve the struct as a complex type.
    +
    +Using Avro record as columns are useful when reading from or writing to a streaming source like Kafka. Each 
    +Kafka key-value record will be augmented with some metadata, such as the ingestion timestamp into Kafka, the offset in Kafka, etc.
    +* If the "value" field that contains your data is in Avro, you could use `from_avro()` to extract your data, enrich it, clean it, and then push it downstream to Kafka again or write it out to a file.
    +* `to_avro()` can be used to turn structs into Avro records. This method is particularly useful when you would like to re-encode multiple columns into a single one when writing data out to Kafka.
    +
    +Both methods are presently only available in Scala and Java.
    +
    +<div class="codetabs">
    +<div data-lang="scala" markdown="1">
    +{% highlight scala %}
    +import org.apache.spark.sql.avro._
    +
    +// `from_avro` requires Avro schema in JSON string format.
    +val jsonFormatSchema = new String(Files.readAllBytes(Paths.get("./examples/src/main/resources/user.avsc")))
    +
    +val df = spark
    +  .readStream
    +  .format("kafka")
    +  .option("kafka.bootstrap.servers", "host1:port1,host2:port2")
    +  .option("subscribe", "topic1")
    +  .load()
    +
    +// 1. Decode the Avro data into a struct;
    +// 2. Filter by column `favorite_color`;
    +// 3. Encode the column `name` in Avro format.
    +val output = df
    +  .select(from_avro('value, jsonFormatSchema) as 'user)
    +  .where("user.favorite_color == \"red\"")
    +  .select(to_avro($"user.name") as 'value)
    +
    +val ds = output
    +  .writeStream
    +  .format("kafka")
    +  .option("kafka.bootstrap.servers", "host1:port1,host2:port2")
    +  .option("topic", "topic2")
    +  .start()
    +
    +{% endhighlight %}
    +</div>
    +<div data-lang="java" markdown="1">
    +{% highlight java %}
    +import org.apache.spark.sql.avro.*
    +
    +// `from_avro` requires Avro schema in JSON string format.
    +String jsonFormatSchema = new String(Files.readAllBytes(Paths.get("./examples/src/main/resources/user.avsc")))
    +
    +Dataset<Row> df = spark
    +  .readStream()
    +  .format("kafka")
    +  .option("kafka.bootstrap.servers", "host1:port1,host2:port2")
    +  .option("subscribe", "topic1")
    +  .load()
    +
    +// 1. Decode the Avro data into a struct;
    +// 2. Filter by column `favorite_color`;
    +// 3. Encode the column `name` in Avro format.
    +DataFrame output = df
    +  .select(from_avro(col("value"), jsonFormatSchema).as("user"))
    --- End diff --
    
    cc @arunmahadevan 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22121: [SPARK-25133][SQL][Doc]AVRO data source guide

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22121
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22121: [WIP][SPARK-25133][SQL][Doc]Avro data source guide

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22121
  
    **[Test build #95099 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95099/testReport)** for PR 22121 at commit [`d2681ec`](https://github.com/apache/spark/commit/d2681ec51a7dbc0296800cdbedb3d46827bf2b6f).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22121: [SPARK-25133][SQL][Doc]Avro data source guide

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22121
  
    **[Test build #94928 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94928/testReport)** for PR 22121 at commit [`8b191bd`](https://github.com/apache/spark/commit/8b191bd37af24ff27b6416ee6af4d885f1c94852).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22121: [SPARK-25133][SQL][Doc]Avro data source guide

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22121
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22121: [SPARK-25133][SQL][Doc]Avro data source guide

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22121
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95116/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22121: [SPARK-25133][SQL][Doc]Avro data source guide

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22121
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94905/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22121: [SPARK-25133][SQL][Doc]Avro data source guide

Posted by gengliangwang <gi...@git.apache.org>.

Github user gengliangwang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22121#discussion_r212027641
  
    --- Diff: docs/avro-data-source-guide.md ---
    @@ -0,0 +1,377 @@
    +---
    +layout: global
    +title: Apache Avro Data Source Guide
    +---
    +
    +* This will become a table of contents (this text will be scraped).
    +{:toc}
    +
    +Since Spark 2.4 release, [Spark SQL](https://spark.apache.org/docs/latest/sql-programming-guide.html) provides built-in support for reading and writing Apache Avro data.
    +
    +## Deploying
    +The `spark-avro` module is external and not included in `spark-submit` or `spark-shell` by default.
    +
    +As with any Spark applications, `spark-submit` is used to launch your application. `spark-avro_{{site.SCALA_BINARY_VERSION}}`
    +and its dependencies can be directly added to `spark-submit` using `--packages`, such as,
    +
    +    ./bin/spark-submit --packages org.apache.spark:spark-avro_{{site.SCALA_BINARY_VERSION}}:{{site.SPARK_VERSION_SHORT}} ...
    +
    +For experimenting on `spark-shell`, you can also use `--packages` to add `org.apache.spark:spark-avro_{{site.SCALA_BINARY_VERSION}}` and its dependencies directly,
    +
    +    ./bin/spark-shell --packages org.apache.spark:spark-avro_{{site.SCALA_BINARY_VERSION}}:{{site.SPARK_VERSION_SHORT}} ...
    +
    +See [Application Submission Guide](submitting-applications.html) for more details about submitting applications with external dependencies.
    +
    +## Load and Save Functions
    +
    +Since `spark-avro` module is external, there is no `.avro` API in 
    +`DataFrameReader` or `DataFrameWriter`.
    +
    +To load/save data in Avro format, you need to specify the data source option `format` as `avro`(or `org.apache.spark.sql.avro`).
    +<div class="codetabs">
    +<div data-lang="scala" markdown="1">
    +{% highlight scala %}
    +
    +val usersDF = spark.read.format("avro").load("examples/src/main/resources/users.avro")
    +usersDF.select("name", "favorite_color").write.format("avro").save("namesAndFavColors.avro")
    +
    +{% endhighlight %}
    +</div>
    +<div data-lang="java" markdown="1">
    +{% highlight java %}
    +
    +Dataset<Row> usersDF = spark.read().format("avro").load("examples/src/main/resources/users.avro");
    +usersDF.select("name", "favorite_color").write().format("avro").save("namesAndFavColors.avro");
    +
    +{% endhighlight %}
    +</div>
    +<div data-lang="python" markdown="1">
    +{% highlight python %}
    +
    +df = spark.read.format("avro").load("examples/src/main/resources/users.avro")
    +df.select("name", "favorite_color").write.format("avro").save("namesAndFavColors.avro")
    +
    +{% endhighlight %}
    +</div>
    +<div data-lang="r" markdown="1">
    +{% highlight r %}
    +
    +df <- read.df("examples/src/main/resources/users.avro", "avro")
    +write.df(select(df, "name", "favorite_color"), "namesAndFavColors.avro", "avro")
    +
    +{% endhighlight %}
    +</div>
    +</div>
    +
    +## to_avro() and from_avro()
    +Spark SQL provides function `to_avro` to encode a struct as a string and `from_avro()` to retrieve the struct as a complex type.
    +
    +Using Avro record as columns are useful when reading from or writing to a streaming source like Kafka. Each 
    +Kafka key-value record will be augmented with some metadata, such as the ingestion timestamp into Kafka, the offset in Kafka, etc.
    +* If the "value" field that contains your data is in Avro, you could use `from_avro()` to extract your data, enrich it, clean it, and then push it downstream to Kafka again or write it out to a file.
    +* `to_avro()` can be used to turn structs into Avro records. This method is particularly useful when you would like to re-encode multiple columns into a single one when writing data out to Kafka.
    +
    +Both methods are presently only available in Scala and Java.
    --- End diff --
    
    I think it should be OK. In SQL programming guid, there is a lot of "currently". Otherwise we have to update the `2.4` for each release.(Is there any way to get the release version in the doc?)


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22121: [SPARK-25133][SQL][Doc]Avro data source guide

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22121
  
    **[Test build #95140 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95140/testReport)** for PR 22121 at commit [`1f253bf`](https://github.com/apache/spark/commit/1f253bf536c3a7bd1c07ba5ea5600f661c8e106e).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22121: [SPARK-25133][SQL][Doc]Avro data source guide

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22121
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95139/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22121: [SPARK-25133][SQL][Doc]Avro data source guide

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22121#discussion_r210922151
  
    --- Diff: docs/sql-programming-guide.md ---
    @@ -1482,6 +1482,9 @@ SELECT * FROM resultTable
     </div>
     </div>
     
    +## AVRO Files
    +See the [AVRO data source guide](avro-data-source-guide.html).
    --- End diff --
    
    Nit: I think it's just called "Avro", and we should call it "Apache Avro" here.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22121: [SPARK-25133][SQL][Doc]Avro data source guide

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22121
  
    **[Test build #95116 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95116/testReport)** for PR 22121 at commit [`581b7e6`](https://github.com/apache/spark/commit/581b7e60e70deac79a15e0a903a78deb10d4f7ac).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22121: [SPARK-25133][SQL][Doc]Avro data source guide

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22121#discussion_r211984616
  
    --- Diff: docs/avro-data-source-guide.md ---
    @@ -0,0 +1,377 @@
    +---
    +layout: global
    +title: Apache Avro Data Source Guide
    +---
    +
    +* This will become a table of contents (this text will be scraped).
    +{:toc}
    +
    +Since Spark 2.4 release, [Spark SQL](https://spark.apache.org/docs/latest/sql-programming-guide.html) provides built-in support for reading and writing Apache Avro data.
    +
    +## Deploying
    +The `spark-avro` module is external and not included in `spark-submit` or `spark-shell` by default.
    +
    +As with any Spark applications, `spark-submit` is used to launch your application. `spark-avro_{{site.SCALA_BINARY_VERSION}}`
    +and its dependencies can be directly added to `spark-submit` using `--packages`, such as,
    +
    +    ./bin/spark-submit --packages org.apache.spark:spark-avro_{{site.SCALA_BINARY_VERSION}}:{{site.SPARK_VERSION_SHORT}} ...
    +
    +For experimenting on `spark-shell`, you can also use `--packages` to add `org.apache.spark:spark-avro_{{site.SCALA_BINARY_VERSION}}` and its dependencies directly,
    +
    +    ./bin/spark-shell --packages org.apache.spark:spark-avro_{{site.SCALA_BINARY_VERSION}}:{{site.SPARK_VERSION_SHORT}} ...
    +
    +See [Application Submission Guide](submitting-applications.html) for more details about submitting applications with external dependencies.
    +
    +## Load and Save Functions
    +
    +Since `spark-avro` module is external, there is no `.avro` API in 
    +`DataFrameReader` or `DataFrameWriter`.
    +
    +To load/save data in Avro format, you need to specify the data source option `format` as `avro`(or `org.apache.spark.sql.avro`).
    +<div class="codetabs">
    +<div data-lang="scala" markdown="1">
    +{% highlight scala %}
    +
    +val usersDF = spark.read.format("avro").load("examples/src/main/resources/users.avro")
    +usersDF.select("name", "favorite_color").write.format("avro").save("namesAndFavColors.avro")
    +
    +{% endhighlight %}
    +</div>
    +<div data-lang="java" markdown="1">
    +{% highlight java %}
    +
    +Dataset<Row> usersDF = spark.read().format("avro").load("examples/src/main/resources/users.avro");
    +usersDF.select("name", "favorite_color").write().format("avro").save("namesAndFavColors.avro");
    +
    +{% endhighlight %}
    +</div>
    +<div data-lang="python" markdown="1">
    +{% highlight python %}
    +
    +df = spark.read.format("avro").load("examples/src/main/resources/users.avro")
    +df.select("name", "favorite_color").write.format("avro").save("namesAndFavColors.avro")
    +
    +{% endhighlight %}
    +</div>
    +<div data-lang="r" markdown="1">
    +{% highlight r %}
    +
    +df <- read.df("examples/src/main/resources/users.avro", "avro")
    +write.df(select(df, "name", "favorite_color"), "namesAndFavColors.avro", "avro")
    +
    +{% endhighlight %}
    +</div>
    +</div>
    +
    +## to_avro() and from_avro()
    +Spark SQL provides function `to_avro` to encode a struct as a string and `from_avro()` to retrieve the struct as a complex type.
    --- End diff --
    
    not "Spark SQL", it should be "The Avro package"


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22121: [SPARK-25133][SQL][Doc]Avro data source guide

Posted by tgravescs <gi...@git.apache.org>.

Github user tgravescs commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22121#discussion_r210917903
  
    --- Diff: docs/avro-data-source-guide.md ---
    @@ -0,0 +1,267 @@
    +---
    +layout: global
    +title: Avro Data Source Guide
    +---
    +
    +Since Spark 2.4 release, [Spark SQL](https://spark.apache.org/docs/latest/sql-programming-guide.html) provides support for reading and writing Avro data.
    +
    +## Deploying
    +The <code>spark-avro</code> module is external and not included in `spark-submit` or `spark-shell` by default.
    +
    +As with any Spark applications, `spark-submit` is used to launch your application. `spark-avro_{{site.SCALA_BINARY_VERSION}}`
    +and its dependencies can be directly added to `spark-submit` using `--packages`, such as,
    +
    +    ./bin/spark-submit --packages org.apache.spark:spark-avro_{{site.SCALA_BINARY_VERSION}}:{{site.SPARK_VERSION_SHORT}} ...
    +
    +For experimenting on `spark-shell`, you can also use `--packages` to add `org.apache.spark:spark-avro_{{site.SCALA_BINARY_VERSION}}` and its dependencies directly,
    +
    +    ./bin/spark-shell --packages org.apache.spark:spark-avro_{{site.SCALA_BINARY_VERSION}}:{{site.SPARK_VERSION_SHORT}} ...
    +
    +See [Application Submission Guide](submitting-applications.html) for more details about submitting applications with external dependencies.
    +
    +## Examples
    +
    +Since `spark-avro` module is external, there is not such API as <code>.avro</code> in 
    --- End diff --
    
    I think this should be higher up not in the examples section.  Perhaps in its own compatibility section.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22121: [SPARK-25133][SQL][Doc]Avro data source guide

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22121#discussion_r211988217
  
    --- Diff: docs/avro-data-source-guide.md ---
    @@ -0,0 +1,377 @@
    +---
    +layout: global
    +title: Apache Avro Data Source Guide
    +---
    +
    +* This will become a table of contents (this text will be scraped).
    +{:toc}
    +
    +Since Spark 2.4 release, [Spark SQL](https://spark.apache.org/docs/latest/sql-programming-guide.html) provides built-in support for reading and writing Apache Avro data.
    +
    +## Deploying
    +The `spark-avro` module is external and not included in `spark-submit` or `spark-shell` by default.
    +
    +As with any Spark applications, `spark-submit` is used to launch your application. `spark-avro_{{site.SCALA_BINARY_VERSION}}`
    +and its dependencies can be directly added to `spark-submit` using `--packages`, such as,
    +
    +    ./bin/spark-submit --packages org.apache.spark:spark-avro_{{site.SCALA_BINARY_VERSION}}:{{site.SPARK_VERSION_SHORT}} ...
    +
    +For experimenting on `spark-shell`, you can also use `--packages` to add `org.apache.spark:spark-avro_{{site.SCALA_BINARY_VERSION}}` and its dependencies directly,
    +
    +    ./bin/spark-shell --packages org.apache.spark:spark-avro_{{site.SCALA_BINARY_VERSION}}:{{site.SPARK_VERSION_SHORT}} ...
    +
    +See [Application Submission Guide](submitting-applications.html) for more details about submitting applications with external dependencies.
    +
    +## Load and Save Functions
    +
    +Since `spark-avro` module is external, there is no `.avro` API in 
    +`DataFrameReader` or `DataFrameWriter`.
    +
    +To load/save data in Avro format, you need to specify the data source option `format` as `avro`(or `org.apache.spark.sql.avro`).
    +<div class="codetabs">
    +<div data-lang="scala" markdown="1">
    +{% highlight scala %}
    +
    +val usersDF = spark.read.format("avro").load("examples/src/main/resources/users.avro")
    +usersDF.select("name", "favorite_color").write.format("avro").save("namesAndFavColors.avro")
    +
    +{% endhighlight %}
    +</div>
    +<div data-lang="java" markdown="1">
    +{% highlight java %}
    +
    +Dataset<Row> usersDF = spark.read().format("avro").load("examples/src/main/resources/users.avro");
    +usersDF.select("name", "favorite_color").write().format("avro").save("namesAndFavColors.avro");
    +
    +{% endhighlight %}
    +</div>
    +<div data-lang="python" markdown="1">
    +{% highlight python %}
    +
    +df = spark.read.format("avro").load("examples/src/main/resources/users.avro")
    +df.select("name", "favorite_color").write.format("avro").save("namesAndFavColors.avro")
    +
    +{% endhighlight %}
    +</div>
    +<div data-lang="r" markdown="1">
    +{% highlight r %}
    +
    +df <- read.df("examples/src/main/resources/users.avro", "avro")
    +write.df(select(df, "name", "favorite_color"), "namesAndFavColors.avro", "avro")
    +
    +{% endhighlight %}
    +</div>
    +</div>
    +
    +## to_avro() and from_avro()
    +Spark SQL provides function `to_avro` to encode a struct as a string and `from_avro()` to retrieve the struct as a complex type.
    +
    +Using Avro record as columns are useful when reading from or writing to a streaming source like Kafka. Each 
    +Kafka key-value record will be augmented with some metadata, such as the ingestion timestamp into Kafka, the offset in Kafka, etc.
    +* If the "value" field that contains your data is in Avro, you could use `from_avro()` to extract your data, enrich it, clean it, and then push it downstream to Kafka again or write it out to a file.
    +* `to_avro()` can be used to turn structs into Avro records. This method is particularly useful when you would like to re-encode multiple columns into a single one when writing data out to Kafka.
    +
    +Both methods are presently only available in Scala and Java.
    +
    +<div class="codetabs">
    +<div data-lang="scala" markdown="1">
    +{% highlight scala %}
    +import org.apache.spark.sql.avro._
    +
    +// `from_avro` requires Avro schema in JSON string format.
    +val jsonFormatSchema = new String(Files.readAllBytes(Paths.get("./examples/src/main/resources/user.avsc")))
    +
    +val df = spark
    +  .readStream
    +  .format("kafka")
    +  .option("kafka.bootstrap.servers", "host1:port1,host2:port2")
    +  .option("subscribe", "topic1")
    +  .load()
    +
    +// 1. Decode the Avro data into a struct;
    +// 2. Filter by column `favorite_color`;
    +// 3. Encode the column `name` in Avro format.
    +val output = df
    +  .select(from_avro('value, jsonFormatSchema) as 'user)
    +  .where("user.favorite_color == \"red\"")
    +  .select(to_avro($"user.name") as 'value)
    +
    +val ds = output
    +  .writeStream
    +  .format("kafka")
    +  .option("kafka.bootstrap.servers", "host1:port1,host2:port2")
    +  .option("topic", "topic2")
    +  .start()
    +
    +{% endhighlight %}
    +</div>
    +<div data-lang="java" markdown="1">
    +{% highlight java %}
    +import org.apache.spark.sql.avro.*
    +
    +// `from_avro` requires Avro schema in JSON string format.
    +String jsonFormatSchema = new String(Files.readAllBytes(Paths.get("./examples/src/main/resources/user.avsc")))
    --- End diff --
    
    Nit: usually calling the `String(byte[])` constructor is a bad idea as it interprets the bytes according to whatever the platform default encoding is. Add `StandardCharsets.UTF_8` as a second arg, but, I odn't know if this is too picky to care about in the example.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22121: [SPARK-25133][SQL][Doc]AVRO data source guide

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22121
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94859/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22121: [SPARK-25133][SQL][Doc]Avro data source guide

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22121#discussion_r211987726
  
    --- Diff: docs/avro-data-source-guide.md ---
    @@ -0,0 +1,377 @@
    +---
    +layout: global
    +title: Apache Avro Data Source Guide
    +---
    +
    +* This will become a table of contents (this text will be scraped).
    +{:toc}
    +
    +Since Spark 2.4 release, [Spark SQL](https://spark.apache.org/docs/latest/sql-programming-guide.html) provides built-in support for reading and writing Apache Avro data.
    +
    +## Deploying
    +The `spark-avro` module is external and not included in `spark-submit` or `spark-shell` by default.
    +
    +As with any Spark applications, `spark-submit` is used to launch your application. `spark-avro_{{site.SCALA_BINARY_VERSION}}`
    +and its dependencies can be directly added to `spark-submit` using `--packages`, such as,
    +
    +    ./bin/spark-submit --packages org.apache.spark:spark-avro_{{site.SCALA_BINARY_VERSION}}:{{site.SPARK_VERSION_SHORT}} ...
    +
    +For experimenting on `spark-shell`, you can also use `--packages` to add `org.apache.spark:spark-avro_{{site.SCALA_BINARY_VERSION}}` and its dependencies directly,
    +
    +    ./bin/spark-shell --packages org.apache.spark:spark-avro_{{site.SCALA_BINARY_VERSION}}:{{site.SPARK_VERSION_SHORT}} ...
    +
    +See [Application Submission Guide](submitting-applications.html) for more details about submitting applications with external dependencies.
    +
    +## Load and Save Functions
    +
    +Since `spark-avro` module is external, there is no `.avro` API in 
    +`DataFrameReader` or `DataFrameWriter`.
    +
    +To load/save data in Avro format, you need to specify the data source option `format` as `avro`(or `org.apache.spark.sql.avro`).
    +<div class="codetabs">
    +<div data-lang="scala" markdown="1">
    +{% highlight scala %}
    +
    +val usersDF = spark.read.format("avro").load("examples/src/main/resources/users.avro")
    +usersDF.select("name", "favorite_color").write.format("avro").save("namesAndFavColors.avro")
    +
    +{% endhighlight %}
    +</div>
    +<div data-lang="java" markdown="1">
    +{% highlight java %}
    +
    +Dataset<Row> usersDF = spark.read().format("avro").load("examples/src/main/resources/users.avro");
    +usersDF.select("name", "favorite_color").write().format("avro").save("namesAndFavColors.avro");
    +
    +{% endhighlight %}
    +</div>
    +<div data-lang="python" markdown="1">
    +{% highlight python %}
    +
    +df = spark.read.format("avro").load("examples/src/main/resources/users.avro")
    +df.select("name", "favorite_color").write.format("avro").save("namesAndFavColors.avro")
    +
    +{% endhighlight %}
    +</div>
    +<div data-lang="r" markdown="1">
    +{% highlight r %}
    +
    +df <- read.df("examples/src/main/resources/users.avro", "avro")
    +write.df(select(df, "name", "favorite_color"), "namesAndFavColors.avro", "avro")
    +
    +{% endhighlight %}
    +</div>
    +</div>
    +
    +## to_avro() and from_avro()
    +Spark SQL provides function `to_avro` to encode a struct as a string and `from_avro()` to retrieve the struct as a complex type.
    +
    +Using Avro record as columns are useful when reading from or writing to a streaming source like Kafka. Each 
    +Kafka key-value record will be augmented with some metadata, such as the ingestion timestamp into Kafka, the offset in Kafka, etc.
    +* If the "value" field that contains your data is in Avro, you could use `from_avro()` to extract your data, enrich it, clean it, and then push it downstream to Kafka again or write it out to a file.
    +* `to_avro()` can be used to turn structs into Avro records. This method is particularly useful when you would like to re-encode multiple columns into a single one when writing data out to Kafka.
    +
    +Both methods are presently only available in Scala and Java.
    +
    +<div class="codetabs">
    +<div data-lang="scala" markdown="1">
    +{% highlight scala %}
    +import org.apache.spark.sql.avro._
    +
    +// `from_avro` requires Avro schema in JSON string format.
    +val jsonFormatSchema = new String(Files.readAllBytes(Paths.get("./examples/src/main/resources/user.avsc")))
    +
    +val df = spark
    +  .readStream
    +  .format("kafka")
    +  .option("kafka.bootstrap.servers", "host1:port1,host2:port2")
    +  .option("subscribe", "topic1")
    +  .load()
    +
    +// 1. Decode the Avro data into a struct;
    +// 2. Filter by column `favorite_color`;
    +// 3. Encode the column `name` in Avro format.
    +val output = df
    +  .select(from_avro('value, jsonFormatSchema) as 'user)
    +  .where("user.favorite_color == \"red\"")
    +  .select(to_avro($"user.name") as 'value)
    +
    +val ds = output
    +  .writeStream
    +  .format("kafka")
    +  .option("kafka.bootstrap.servers", "host1:port1,host2:port2")
    +  .option("topic", "topic2")
    +  .start()
    +
    +{% endhighlight %}
    +</div>
    +<div data-lang="java" markdown="1">
    +{% highlight java %}
    +import org.apache.spark.sql.avro.*
    +
    +// `from_avro` requires Avro schema in JSON string format.
    +String jsonFormatSchema = new String(Files.readAllBytes(Paths.get("./examples/src/main/resources/user.avsc")))
    +
    +Dataset<Row> df = spark
    +  .readStream()
    +  .format("kafka")
    +  .option("kafka.bootstrap.servers", "host1:port1,host2:port2")
    +  .option("subscribe", "topic1")
    +  .load()
    +
    +// 1. Decode the Avro data into a struct;
    +// 2. Filter by column `favorite_color`;
    +// 3. Encode the column `name` in Avro format.
    +DataFrame output = df
    +  .select(from_avro(col("value"), jsonFormatSchema).as("user"))
    +  .where("user.favorite_color == \"red\"")
    +  .select(to_avro(col("user.name")).as("value"))
    +
    +StreamingQuery ds = output
    +  .writeStream()
    +  .format("kafka")
    +  .option("kafka.bootstrap.servers", "host1:port1,host2:port2")
    +  .option("topic", "topic2")
    +  .start()
    +
    +{% endhighlight %}
    +</div>
    +</div>
    +
    +## Data Source Option
    +
    +Data source options of Avro can be set using the `.option` method on `DataFrameReader` or `DataFrameWriter`.
    +<table class="table">
    +  <tr><th><b>Property Name</b></th><th><b>Default</b></th><th><b>Meaning</b></th><th><b>Scope</b></th></tr>
    +  <tr>
    +    <td><code>avroSchema</code></td>
    +    <td>None</td>
    +    <td>Optional Avro schema provided by an user in JSON format.</td>
    +    <td>read and write</td>
    +  </tr>
    +  <tr>
    +    <td><code>recordName</code></td>
    +    <td>topLevelRecord</td>
    +    <td>Top level record name in write result, which is required in Avro spec.</td>
    +    <td>write</td>
    +  </tr>
    +  <tr>
    +    <td><code>recordNamespace</code></td>
    +    <td>""</td>
    +    <td>Record namespace in write result.</td>
    +    <td>write</td>
    +  </tr>
    +  <tr>
    +    <td><code>ignoreExtension</code></td>
    +    <td>true</td>
    +    <td>The option controls ignoring of files without <code>.avro</code> extensions in read.<br> If the option is enabled, all files (with and without <code>.avro</code> extension) are loaded.</td>
    +    <td>read</td>
    +  </tr>
    +  <tr>
    +    <td><code>compression</code></td>
    +    <td>snappy</td>
    +    <td>The <code>compression</code> option allows to specify a compression codec used in write.<br>
    +  Currently supported codecs are <code>uncompressed</code>, <code>snappy</code>, <code>deflate</code>, <code>bzip2</code> and <code>xz</code>.<br> If the option is not set, the configuration <code>spark.sql.avro.compression.codec</code> config is taken into account.</td>
    +    <td>write</td>
    +  </tr>
    +</table>
    +
    +## Configuration
    +Configuration of Avro can be done using the `setConf` method on SparkSession or by running `SET key=value` commands using SQL.
    +<table class="table">
    +  <tr><th><b>Property Name</b></th><th><b>Default</b></th><th><b>Meaning</b></th></tr>
    +  <tr>
    +    <td>spark.sql.legacy.replaceDatabricksSparkAvro.enabled</td>
    +    <td>true</td>
    +    <td>If it is set to true, the data source provider <code>com.databricks.spark.avro</code> is mapped to the built-in but external Avro data source module for backward compatibility.</td>
    +  </tr>
    +  <tr>
    +    <td>spark.sql.avro.compression.codec</td>
    +    <td>snappy</td>
    +    <td>Compression codec used in writing of AVRO files. Supported codecs: uncompressed, deflate, snappy, bzip2 and xz. Default codec is snappy.</td>
    +  </tr>
    +  <tr>
    +    <td>spark.sql.avro.deflate.level</td>
    +    <td>-1</td>
    +    <td>Compression level for the deflate codec used in writing of AVRO files. Valid value must be in the range of from 1 to 9 inclusive or -1. The default value is -1 which corresponds to 6 level in the current implementation.</td>
    +  </tr>
    +</table>
    +
    +## Compatibility with Databricks spark-avro
    +This Avro data source module is originally from and compatible with Databricks's open source repository 
    +[spark-avro](https://github.com/databricks/spark-avro).
    +
    +By default with the SQL configuration `spark.sql.legacy.replaceDatabricksSparkAvro.enabled` enabled, the data source provider `com.databricks.spark.avro` is 
    +mapped to this built-in Avro module. For the Spark tables created with `Provider` property as `com.databricks.spark.avro` in 
    +catalog meta store, the mapping is essential to load these tables if you are using this built-in Avro module. 
    +
    +Note in Databricks's [spark-avro](https://github.com/databricks/spark-avro), implicit classes 
    +`AvroDataFrameWriter` and `AvroDataFrameReader` were created for shortcut function `.avro()`. In this 
    +built-in but external module, both implicit classes are removed. Please use `.format("avro")` in 
    +`DataFrameWriter` or `DataFrameReader` instead, which should be clean and good enough.
    +
    +If you prefer using your own build of `spark-avro` jar file, you can simply disable the configuration 
    +`spark.sql.legacy.replaceDatabricksSparkAvro.enabled`, and use the option `--jars` on deploying your 
    +applications. Read the [Advanced Dependency Management](https://spark.apache
    +.org/docs/latest/submitting-applications.html#advanced-dependency-management) section in Application 
    +Submission Guide for more details. 
    +
    +## Supported types for Avro -> Spark SQL conversion
    +Currently Spark supports reading all [primitive types](https://avro.apache.org/docs/1.8.2/spec.html#schema_primitive) and [complex types](https://avro.apache.org/docs/1.8.2/spec.html#schema_complex) of Avro.
    +<table class="table">
    +  <tr><th><b>Avro type</b></th><th><b>Spark SQL type</b></th></tr>
    +  <tr>
    +    <td>boolean</td>
    +    <td>BooleanType</td>
    +  </tr>
    +  <tr>
    +    <td>int</td>
    +    <td>IntegerType</td>
    --- End diff --
    
    Byte and Short both map to avro int, right?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22121: [SPARK-25133][SQL][Doc]Avro data source guide

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22121
  
    **[Test build #95139 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95139/testReport)** for PR 22121 at commit [`8245806`](https://github.com/apache/spark/commit/824580684c05c2a3c1654517b77864ca5d504ee0).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22121: [SPARK-25133][SQL][Doc]AVRO data source guide

Posted by gengliangwang <gi...@git.apache.org>.

Github user gengliangwang commented on the issue:

    https://github.com/apache/spark/pull/22121
  
    @cloud-fan @gatorsmile 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22121: [SPARK-25133][SQL][Doc]Avro data source guide

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22121
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22121: [SPARK-25133][SQL][Doc]Avro data source guide

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22121#discussion_r210922729
  
    --- Diff: docs/avro-data-source-guide.md ---
    @@ -0,0 +1,267 @@
    +---
    +layout: global
    +title: Avro Data Source Guide
    +---
    +
    +Since Spark 2.4 release, [Spark SQL](https://spark.apache.org/docs/latest/sql-programming-guide.html) provides support for reading and writing Avro data.
    +
    +## Deploying
    +The <code>spark-avro</code> module is external and not included in `spark-submit` or `spark-shell` by default.
    +
    +As with any Spark applications, `spark-submit` is used to launch your application. `spark-avro_{{site.SCALA_BINARY_VERSION}}`
    +and its dependencies can be directly added to `spark-submit` using `--packages`, such as,
    +
    +    ./bin/spark-submit --packages org.apache.spark:spark-avro_{{site.SCALA_BINARY_VERSION}}:{{site.SPARK_VERSION_SHORT}} ...
    +
    +For experimenting on `spark-shell`, you can also use `--packages` to add `org.apache.spark:spark-avro_{{site.SCALA_BINARY_VERSION}}` and its dependencies directly,
    +
    +    ./bin/spark-shell --packages org.apache.spark:spark-avro_{{site.SCALA_BINARY_VERSION}}:{{site.SPARK_VERSION_SHORT}} ...
    +
    +See [Application Submission Guide](submitting-applications.html) for more details about submitting applications with external dependencies.
    +
    +## Examples
    +
    +Since `spark-avro` module is external, there is not such API as <code>.avro</code> in 
    +<code>DataFrameReader</code> or <code>DataFrameWriter</code>.
    +To load/save data in Avro format, you need to specify the data source option <code>format</code> as short name <code>avro</code> or full name <code>org.apache.spark.sql.avro</code>.
    +<div class="codetabs">
    +<div data-lang="scala" markdown="1">
    +{% highlight scala %}
    +
    +val usersDF = spark.read.format("avro").load("examples/src/main/resources/users.avro")
    +usersDF.select("name", "favorite_color").write.format("avro").save("namesAndFavColors.avro")
    +
    +{% endhighlight %}
    +</div>
    +<div data-lang="java" markdown="1">
    +{% highlight java %}
    +
    +Dataset<Row> usersDF = spark.read().format("avro").load("examples/src/main/resources/users.avro");
    +usersDF.select("name", "favorite_color").write().format("avro").save("namesAndFavColors.avro");
    +
    +{% endhighlight %}
    +</div>
    +<div data-lang="python" markdown="1">
    +{% highlight python %}
    +
    +df = spark.read.format("avro").load("examples/src/main/resources/users.avro")
    +df.select("name", "favorite_color").write.format("avro").save("namesAndFavColors.avro")
    +
    +{% endhighlight %}
    +</div>
    +<div data-lang="r" markdown="1">
    +{% highlight r %}
    +
    +df <- read.df("examples/src/main/resources/users.avro", "avro")
    +write.df(select(df, "name", "favorite_color"), "namesAndFavColors.avro", "avro")
    +
    +{% endhighlight %}
    +</div>
    +</div>
    +
    +## Configuration
    --- End diff --
    
    Space after headings like this


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22121: [SPARK-25133][SQL][Doc]Avro data source guide

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/22121
  
    We also need to document the extra enhancements that are added in this release, compared with the databricks/spark-avro package. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22121: [SPARK-25133][SQL][Doc]Avro data source guide

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22121
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22121: [SPARK-25133][SQL][Doc]Avro data source guide

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22121
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95113/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22121: [SPARK-25133][SQL][Doc]Avro data source guide

Posted by gengliangwang <gi...@git.apache.org>.

Github user gengliangwang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22121#discussion_r211081684
  
    --- Diff: docs/avro-data-source-guide.md ---
    @@ -0,0 +1,260 @@
    +---
    +layout: global
    +title: Apache Avro Data Source Guide
    +---
    +
    +* This will become a table of contents (this text will be scraped).
    +{:toc}
    +
    +Since Spark 2.4 release, [Spark SQL](https://spark.apache.org/docs/latest/sql-programming-guide.html) provides built-in support for reading and writing Apache Avro data.
    +
    +## Deploying
    +The `spark-avro` module is external and not included in `spark-submit` or `spark-shell` by default.
    +
    +As with any Spark applications, `spark-submit` is used to launch your application. `spark-avro_{{site.SCALA_BINARY_VERSION}}`
    +and its dependencies can be directly added to `spark-submit` using `--packages`, such as,
    +
    +    ./bin/spark-submit --packages org.apache.spark:spark-avro_{{site.SCALA_BINARY_VERSION}}:{{site.SPARK_VERSION_SHORT}} ...
    +
    +For experimenting on `spark-shell`, you can also use `--packages` to add `org.apache.spark:spark-avro_{{site.SCALA_BINARY_VERSION}}` and its dependencies directly,
    +
    +    ./bin/spark-shell --packages org.apache.spark:spark-avro_{{site.SCALA_BINARY_VERSION}}:{{site.SPARK_VERSION_SHORT}} ...
    +
    +See [Application Submission Guide](submitting-applications.html) for more details about submitting applications with external dependencies.
    +
    +## Load/Save Functions
    +
    +Since `spark-avro` module is external, there is not such API as `.avro` in 
    +`DataFrameReader` or `DataFrameWriter`.
    +To load/save data in Avro format, you need to specify the data source option `format` as short name `avro` or full name `org.apache.spark.sql.avro`.
    +<div class="codetabs">
    +<div data-lang="scala" markdown="1">
    +{% highlight scala %}
    +
    +val usersDF = spark.read.format("avro").load("examples/src/main/resources/users.avro")
    +usersDF.select("name", "favorite_color").write.format("avro").save("namesAndFavColors.avro")
    +
    +{% endhighlight %}
    +</div>
    +<div data-lang="java" markdown="1">
    +{% highlight java %}
    +
    +Dataset<Row> usersDF = spark.read().format("avro").load("examples/src/main/resources/users.avro");
    +usersDF.select("name", "favorite_color").write().format("avro").save("namesAndFavColors.avro");
    +
    +{% endhighlight %}
    +</div>
    +<div data-lang="python" markdown="1">
    +{% highlight python %}
    +
    +df = spark.read.format("avro").load("examples/src/main/resources/users.avro")
    +df.select("name", "favorite_color").write.format("avro").save("namesAndFavColors.avro")
    +
    +{% endhighlight %}
    +</div>
    +<div data-lang="r" markdown="1">
    +{% highlight r %}
    +
    +df <- read.df("examples/src/main/resources/users.avro", "avro")
    +write.df(select(df, "name", "favorite_color"), "namesAndFavColors.avro", "avro")
    +
    +{% endhighlight %}
    +</div>
    +</div>
    +
    +## Data Source Options
    +
    +Data source options of Avro can be set using the `.option` method on `DataFrameReader` or `DataFrameWriter`.
    +<table class="table">
    +  <tr><th><b>Property Name</b></th><th><b>Default</b></th><th><b>Meaning</b></th><th><b>Scope</b></th></tr>
    +  <tr>
    +    <td><code>avroSchema</code></td>
    +    <td>None</td>
    +    <td>Optional Avro schema provided by an user in JSON format.</td>
    +    <td>read and write</td>
    +  </tr>
    +  <tr>
    +    <td><code>recordName</code></td>
    +    <td>topLevelRecord</td>
    +    <td>Top level record name in write result, which is required in Avro spec.</td>
    +    <td>write</td>
    +  </tr>
    +  <tr>
    +    <td><code>recordNamespace</code></td>
    +    <td>""</td>
    +    <td>Record namespace in write result.</td>
    +    <td>write</td>
    +  </tr>
    +  <tr>
    +    <td><code>ignoreExtension</code></td>
    +    <td>true</td>
    +    <td>The option controls ignoring of files without <code>.avro</code> extensions in read. If the option is enabled, all files (with and without <code>.avro</code> extension) are loaded.</td>
    +    <td>read</td>
    +  </tr>
    +  <tr>
    +    <td><code>compression</code></td>
    +    <td>snappy</td>
    +    <td>The <code>compression</code> option allows to specify a compression codec used in write. Currently supported codecs are <code>uncompressed</code>, <code>snappy</code>, <code>deflate</code>, <code>bzip2</code> and <code>xz</code>. If the option is not set, the configuration <code>spark.sql.avro.compression.codec</code> config is taken into account.</td>
    +    <td>write</td>
    +  </tr>
    +</table>
    +
    +## Supported types for Avro -> Spark SQL conversion
    +Currently Spark supports reading all [primitive types](https://avro.apache.org/docs/1.8.2/spec.html#schema_primitive) and [complex types](https://avro.apache.org/docs/1.8.2/spec.html#schema_complex) of Avro.
    +<table class="table">
    +  <tr><th><b>Avro type</b></th><th><b>Spark SQL type</b></th></tr>
    +  <tr>
    +    <td>boolean</td>
    +    <td>BooleanType</td>
    +  </tr>
    +  <tr>
    +    <td>int</td>
    +    <td>IntegerType</td>
    +  </tr>
    +  <tr>
    +    <td>long</td>
    +    <td>LongType</td>
    +  </tr>
    +  <tr>
    +    <td>float</td>
    +    <td>FloatType</td>
    +  </tr>
    +  <tr>
    +    <td>double</td>
    +    <td>DoubleType</td>
    +  </tr>
    +  <tr>
    +    <td>string</td>
    +    <td>StringType</td>
    +  </tr>
    +  <tr>
    +    <td>enum</td>
    +    <td>StringType</td>
    +  </tr>
    +  <tr>
    +    <td>fixed</td>
    +    <td>BinaryType</td>
    +  </tr>
    +  <tr>
    +    <td>bytes</td>
    +    <td>BinaryType</td>
    +  </tr>
    +  <tr>
    +    <td>record</td>
    +    <td>StructType</td>
    +  </tr>
    +  <tr>
    +    <td>array</td>
    +    <td>ArrayType</td>
    +  </tr>
    +  <tr>
    +    <td>map</td>
    +    <td>MapType</td>
    +  </tr>
    +  <tr>
    +    <td>union</td>
    +    <td>See below</td>
    +  </tr>
    +</table>
    +
    +In addition to the types listed above, it supports reading `union` types. The following three types are considered basic `union` types:
    +
    +1. `union(int, long)` will be mapped to LongType.
    +2. `union(float, double)` will be mapped to DoubleType.
    +3. `union(something, null)`, where something is any supported Avro type. This will be mapped to the same Spark SQL type as that of something, with nullable set to true.
    +All other union types are considered complex. They will be mapped to StructType where field names are member0, member1, etc., in accordance with members of the union. This is consistent with the behavior when converting between Avro and Parquet.
    +
    +It also supports reading the following Avro [logical types](https://avro.apache.org/docs/1.8.2/spec.html#Logical+Types):
    +
    +<table class="table">
    +  <tr><th><b>Avro logical type</b></th><th><b>Avro type</b></th><th><b>Spark SQL type</b></th></tr>
    +  <tr>
    +    <td>date</td>
    +    <td>int</td>
    +    <td>DateType</td>
    +  </tr>
    +  <tr>
    +    <td>timestamp-millis</td>
    +    <td>long</td>
    +    <td>TimestampType</td>
    +  </tr>
    +  <tr>
    +    <td>timestamp-micros</td>
    +    <td>long</td>
    +    <td>TimestampType</td>
    +  </tr>
    +  <tr>
    +    <td>decimal</td>
    +    <td>bytes</td>
    +    <td>DecimalType</td>
    +  </tr>
    +  <tr>
    +    <td>decimal</td>
    +    <td>bytes</td>
    +    <td>DecimalType</td>
    +  </tr>
    --- End diff --
    
    It is a mistake. Thanks for pointing out!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22121: [SPARK-25133][SQL][Doc]Avro data source guide

Posted by dhruve <gi...@git.apache.org>.

Github user dhruve commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22121#discussion_r212075677
  
    --- Diff: docs/avro-data-source-guide.md ---
    @@ -0,0 +1,377 @@
    +---
    +layout: global
    +title: Apache Avro Data Source Guide
    +---
    +
    +* This will become a table of contents (this text will be scraped).
    +{:toc}
    +
    +Since Spark 2.4 release, [Spark SQL](https://spark.apache.org/docs/latest/sql-programming-guide.html) provides built-in support for reading and writing Apache Avro data.
    +
    +## Deploying
    +The `spark-avro` module is external and not included in `spark-submit` or `spark-shell` by default.
    +
    +As with any Spark applications, `spark-submit` is used to launch your application. `spark-avro_{{site.SCALA_BINARY_VERSION}}`
    +and its dependencies can be directly added to `spark-submit` using `--packages`, such as,
    +
    +    ./bin/spark-submit --packages org.apache.spark:spark-avro_{{site.SCALA_BINARY_VERSION}}:{{site.SPARK_VERSION_SHORT}} ...
    +
    +For experimenting on `spark-shell`, you can also use `--packages` to add `org.apache.spark:spark-avro_{{site.SCALA_BINARY_VERSION}}` and its dependencies directly,
    +
    +    ./bin/spark-shell --packages org.apache.spark:spark-avro_{{site.SCALA_BINARY_VERSION}}:{{site.SPARK_VERSION_SHORT}} ...
    +
    +See [Application Submission Guide](submitting-applications.html) for more details about submitting applications with external dependencies.
    +
    +## Load and Save Functions
    +
    +Since `spark-avro` module is external, there is no `.avro` API in 
    +`DataFrameReader` or `DataFrameWriter`.
    +
    +To load/save data in Avro format, you need to specify the data source option `format` as `avro`(or `org.apache.spark.sql.avro`).
    +<div class="codetabs">
    +<div data-lang="scala" markdown="1">
    +{% highlight scala %}
    +
    +val usersDF = spark.read.format("avro").load("examples/src/main/resources/users.avro")
    +usersDF.select("name", "favorite_color").write.format("avro").save("namesAndFavColors.avro")
    +
    +{% endhighlight %}
    +</div>
    +<div data-lang="java" markdown="1">
    +{% highlight java %}
    +
    +Dataset<Row> usersDF = spark.read().format("avro").load("examples/src/main/resources/users.avro");
    +usersDF.select("name", "favorite_color").write().format("avro").save("namesAndFavColors.avro");
    +
    +{% endhighlight %}
    +</div>
    +<div data-lang="python" markdown="1">
    +{% highlight python %}
    +
    +df = spark.read.format("avro").load("examples/src/main/resources/users.avro")
    +df.select("name", "favorite_color").write.format("avro").save("namesAndFavColors.avro")
    +
    +{% endhighlight %}
    +</div>
    +<div data-lang="r" markdown="1">
    +{% highlight r %}
    +
    +df <- read.df("examples/src/main/resources/users.avro", "avro")
    +write.df(select(df, "name", "favorite_color"), "namesAndFavColors.avro", "avro")
    +
    +{% endhighlight %}
    +</div>
    +</div>
    +
    +## to_avro() and from_avro()
    +Spark SQL provides function `to_avro` to encode a struct as a string and `from_avro()` to retrieve the struct as a complex type.
    +
    +Using Avro record as columns are useful when reading from or writing to a streaming source like Kafka. Each 
    +Kafka key-value record will be augmented with some metadata, such as the ingestion timestamp into Kafka, the offset in Kafka, etc.
    +* If the "value" field that contains your data is in Avro, you could use `from_avro()` to extract your data, enrich it, clean it, and then push it downstream to Kafka again or write it out to a file.
    +* `to_avro()` can be used to turn structs into Avro records. This method is particularly useful when you would like to re-encode multiple columns into a single one when writing data out to Kafka.
    +
    +Both methods are presently only available in Scala and Java.
    +
    +<div class="codetabs">
    +<div data-lang="scala" markdown="1">
    +{% highlight scala %}
    +import org.apache.spark.sql.avro._
    +
    +// `from_avro` requires Avro schema in JSON string format.
    +val jsonFormatSchema = new String(Files.readAllBytes(Paths.get("./examples/src/main/resources/user.avsc")))
    +
    +val df = spark
    +  .readStream
    +  .format("kafka")
    +  .option("kafka.bootstrap.servers", "host1:port1,host2:port2")
    +  .option("subscribe", "topic1")
    +  .load()
    +
    +// 1. Decode the Avro data into a struct;
    +// 2. Filter by column `favorite_color`;
    +// 3. Encode the column `name` in Avro format.
    +val output = df
    +  .select(from_avro('value, jsonFormatSchema) as 'user)
    +  .where("user.favorite_color == \"red\"")
    +  .select(to_avro($"user.name") as 'value)
    +
    +val ds = output
    +  .writeStream
    +  .format("kafka")
    +  .option("kafka.bootstrap.servers", "host1:port1,host2:port2")
    +  .option("topic", "topic2")
    +  .start()
    +
    +{% endhighlight %}
    +</div>
    +<div data-lang="java" markdown="1">
    +{% highlight java %}
    +import org.apache.spark.sql.avro.*
    +
    +// `from_avro` requires Avro schema in JSON string format.
    +String jsonFormatSchema = new String(Files.readAllBytes(Paths.get("./examples/src/main/resources/user.avsc")))
    +
    +Dataset<Row> df = spark
    +  .readStream()
    +  .format("kafka")
    +  .option("kafka.bootstrap.servers", "host1:port1,host2:port2")
    +  .option("subscribe", "topic1")
    +  .load()
    +
    +// 1. Decode the Avro data into a struct;
    +// 2. Filter by column `favorite_color`;
    +// 3. Encode the column `name` in Avro format.
    +DataFrame output = df
    +  .select(from_avro(col("value"), jsonFormatSchema).as("user"))
    +  .where("user.favorite_color == \"red\"")
    +  .select(to_avro(col("user.name")).as("value"))
    +
    +StreamingQuery ds = output
    +  .writeStream()
    +  .format("kafka")
    +  .option("kafka.bootstrap.servers", "host1:port1,host2:port2")
    +  .option("topic", "topic2")
    +  .start()
    +
    +{% endhighlight %}
    +</div>
    +</div>
    +
    +## Data Source Option
    +
    +Data source options of Avro can be set using the `.option` method on `DataFrameReader` or `DataFrameWriter`.
    +<table class="table">
    +  <tr><th><b>Property Name</b></th><th><b>Default</b></th><th><b>Meaning</b></th><th><b>Scope</b></th></tr>
    +  <tr>
    +    <td><code>avroSchema</code></td>
    +    <td>None</td>
    +    <td>Optional Avro schema provided by an user in JSON format.</td>
    +    <td>read and write</td>
    +  </tr>
    +  <tr>
    +    <td><code>recordName</code></td>
    +    <td>topLevelRecord</td>
    +    <td>Top level record name in write result, which is required in Avro spec.</td>
    +    <td>write</td>
    +  </tr>
    +  <tr>
    +    <td><code>recordNamespace</code></td>
    +    <td>""</td>
    +    <td>Record namespace in write result.</td>
    +    <td>write</td>
    +  </tr>
    +  <tr>
    +    <td><code>ignoreExtension</code></td>
    +    <td>true</td>
    +    <td>The option controls ignoring of files without <code>.avro</code> extensions in read.<br> If the option is enabled, all files (with and without <code>.avro</code> extension) are loaded.</td>
    +    <td>read</td>
    +  </tr>
    +  <tr>
    +    <td><code>compression</code></td>
    +    <td>snappy</td>
    +    <td>The <code>compression</code> option allows to specify a compression codec used in write.<br>
    +  Currently supported codecs are <code>uncompressed</code>, <code>snappy</code>, <code>deflate</code>, <code>bzip2</code> and <code>xz</code>.<br> If the option is not set, the configuration <code>spark.sql.avro.compression.codec</code> config is taken into account.</td>
    +    <td>write</td>
    +  </tr>
    +</table>
    +
    +## Configuration
    +Configuration of Avro can be done using the `setConf` method on SparkSession or by running `SET key=value` commands using SQL.
    +<table class="table">
    +  <tr><th><b>Property Name</b></th><th><b>Default</b></th><th><b>Meaning</b></th></tr>
    +  <tr>
    +    <td>spark.sql.legacy.replaceDatabricksSparkAvro.enabled</td>
    +    <td>true</td>
    +    <td>If it is set to true, the data source provider <code>com.databricks.spark.avro</code> is mapped to the built-in but external Avro data source module for backward compatibility.</td>
    +  </tr>
    +  <tr>
    +    <td>spark.sql.avro.compression.codec</td>
    +    <td>snappy</td>
    +    <td>Compression codec used in writing of AVRO files. Supported codecs: uncompressed, deflate, snappy, bzip2 and xz. Default codec is snappy.</td>
    +  </tr>
    +  <tr>
    +    <td>spark.sql.avro.deflate.level</td>
    +    <td>-1</td>
    +    <td>Compression level for the deflate codec used in writing of AVRO files. Valid value must be in the range of from 1 to 9 inclusive or -1. The default value is -1 which corresponds to 6 level in the current implementation.</td>
    +  </tr>
    +</table>
    +
    +## Compatibility with Databricks spark-avro
    +This Avro data source module is originally from and compatible with Databricks's open source repository 
    +[spark-avro](https://github.com/databricks/spark-avro).
    +
    +By default with the SQL configuration `spark.sql.legacy.replaceDatabricksSparkAvro.enabled` enabled, the data source provider `com.databricks.spark.avro` is 
    +mapped to this built-in Avro module. For the Spark tables created with `Provider` property as `com.databricks.spark.avro` in 
    +catalog meta store, the mapping is essential to load these tables if you are using this built-in Avro module. 
    +
    +Note in Databricks's [spark-avro](https://github.com/databricks/spark-avro), implicit classes 
    +`AvroDataFrameWriter` and `AvroDataFrameReader` were created for shortcut function `.avro()`. In this 
    +built-in but external module, both implicit classes are removed. Please use `.format("avro")` in 
    +`DataFrameWriter` or `DataFrameReader` instead, which should be clean and good enough.
    +
    +If you prefer using your own build of `spark-avro` jar file, you can simply disable the configuration 
    +`spark.sql.legacy.replaceDatabricksSparkAvro.enabled`, and use the option `--jars` on deploying your 
    +applications. Read the [Advanced Dependency Management](https://spark.apache
    +.org/docs/latest/submitting-applications.html#advanced-dependency-management) section in Application 
    +Submission Guide for more details. 
    +
    +## Supported types for Avro -> Spark SQL conversion
    +Currently Spark supports reading all [primitive types](https://avro.apache.org/docs/1.8.2/spec.html#schema_primitive) and [complex types](https://avro.apache.org/docs/1.8.2/spec.html#schema_complex) of Avro.
    --- End diff --
    
    @gengliangwang ^^


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22121: [SPARK-25133][SQL][Doc]AVRO data source guide

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22121
  
    **[Test build #94886 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94886/testReport)** for PR 22121 at commit [`72c8ef2`](https://github.com/apache/spark/commit/72c8ef21d966ff2b2471a998323fd7b24278c12f).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22121: [SPARK-25133][SQL][Doc]Avro data source guide

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22121
  
    **[Test build #95099 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95099/testReport)** for PR 22121 at commit [`d2681ec`](https://github.com/apache/spark/commit/d2681ec51a7dbc0296800cdbedb3d46827bf2b6f).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22121: [SPARK-25133][SQL][Doc]Avro data source guide

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22121
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2474/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22121: [SPARK-25133][SQL][Doc]Avro data source guide

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22121
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94886/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22121: [SPARK-25133][SQL][Doc]Avro data source guide

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22121
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22121: [SPARK-25133][SQL][Doc]Avro data source guide

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22121#discussion_r211987834
  
    --- Diff: docs/avro-data-source-guide.md ---
    @@ -0,0 +1,377 @@
    +---
    +layout: global
    +title: Apache Avro Data Source Guide
    +---
    +
    +* This will become a table of contents (this text will be scraped).
    +{:toc}
    +
    +Since Spark 2.4 release, [Spark SQL](https://spark.apache.org/docs/latest/sql-programming-guide.html) provides built-in support for reading and writing Apache Avro data.
    +
    +## Deploying
    +The `spark-avro` module is external and not included in `spark-submit` or `spark-shell` by default.
    +
    +As with any Spark applications, `spark-submit` is used to launch your application. `spark-avro_{{site.SCALA_BINARY_VERSION}}`
    +and its dependencies can be directly added to `spark-submit` using `--packages`, such as,
    +
    +    ./bin/spark-submit --packages org.apache.spark:spark-avro_{{site.SCALA_BINARY_VERSION}}:{{site.SPARK_VERSION_SHORT}} ...
    +
    +For experimenting on `spark-shell`, you can also use `--packages` to add `org.apache.spark:spark-avro_{{site.SCALA_BINARY_VERSION}}` and its dependencies directly,
    +
    +    ./bin/spark-shell --packages org.apache.spark:spark-avro_{{site.SCALA_BINARY_VERSION}}:{{site.SPARK_VERSION_SHORT}} ...
    +
    +See [Application Submission Guide](submitting-applications.html) for more details about submitting applications with external dependencies.
    +
    +## Load and Save Functions
    +
    +Since `spark-avro` module is external, there is no `.avro` API in 
    +`DataFrameReader` or `DataFrameWriter`.
    +
    +To load/save data in Avro format, you need to specify the data source option `format` as `avro`(or `org.apache.spark.sql.avro`).
    +<div class="codetabs">
    +<div data-lang="scala" markdown="1">
    +{% highlight scala %}
    +
    +val usersDF = spark.read.format("avro").load("examples/src/main/resources/users.avro")
    +usersDF.select("name", "favorite_color").write.format("avro").save("namesAndFavColors.avro")
    +
    +{% endhighlight %}
    +</div>
    +<div data-lang="java" markdown="1">
    +{% highlight java %}
    +
    +Dataset<Row> usersDF = spark.read().format("avro").load("examples/src/main/resources/users.avro");
    +usersDF.select("name", "favorite_color").write().format("avro").save("namesAndFavColors.avro");
    +
    +{% endhighlight %}
    +</div>
    +<div data-lang="python" markdown="1">
    +{% highlight python %}
    +
    +df = spark.read.format("avro").load("examples/src/main/resources/users.avro")
    +df.select("name", "favorite_color").write.format("avro").save("namesAndFavColors.avro")
    +
    +{% endhighlight %}
    +</div>
    +<div data-lang="r" markdown="1">
    +{% highlight r %}
    +
    +df <- read.df("examples/src/main/resources/users.avro", "avro")
    +write.df(select(df, "name", "favorite_color"), "namesAndFavColors.avro", "avro")
    +
    +{% endhighlight %}
    +</div>
    +</div>
    +
    +## to_avro() and from_avro()
    +Spark SQL provides function `to_avro` to encode a struct as a string and `from_avro()` to retrieve the struct as a complex type.
    +
    +Using Avro record as columns are useful when reading from or writing to a streaming source like Kafka. Each 
    +Kafka key-value record will be augmented with some metadata, such as the ingestion timestamp into Kafka, the offset in Kafka, etc.
    +* If the "value" field that contains your data is in Avro, you could use `from_avro()` to extract your data, enrich it, clean it, and then push it downstream to Kafka again or write it out to a file.
    +* `to_avro()` can be used to turn structs into Avro records. This method is particularly useful when you would like to re-encode multiple columns into a single one when writing data out to Kafka.
    +
    +Both methods are presently only available in Scala and Java.
    +
    +<div class="codetabs">
    +<div data-lang="scala" markdown="1">
    +{% highlight scala %}
    +import org.apache.spark.sql.avro._
    +
    +// `from_avro` requires Avro schema in JSON string format.
    +val jsonFormatSchema = new String(Files.readAllBytes(Paths.get("./examples/src/main/resources/user.avsc")))
    +
    +val df = spark
    +  .readStream
    +  .format("kafka")
    +  .option("kafka.bootstrap.servers", "host1:port1,host2:port2")
    +  .option("subscribe", "topic1")
    +  .load()
    +
    +// 1. Decode the Avro data into a struct;
    +// 2. Filter by column `favorite_color`;
    +// 3. Encode the column `name` in Avro format.
    +val output = df
    +  .select(from_avro('value, jsonFormatSchema) as 'user)
    +  .where("user.favorite_color == \"red\"")
    +  .select(to_avro($"user.name") as 'value)
    +
    +val ds = output
    +  .writeStream
    +  .format("kafka")
    +  .option("kafka.bootstrap.servers", "host1:port1,host2:port2")
    +  .option("topic", "topic2")
    +  .start()
    +
    +{% endhighlight %}
    +</div>
    +<div data-lang="java" markdown="1">
    +{% highlight java %}
    +import org.apache.spark.sql.avro.*
    +
    +// `from_avro` requires Avro schema in JSON string format.
    +String jsonFormatSchema = new String(Files.readAllBytes(Paths.get("./examples/src/main/resources/user.avsc")))
    +
    +Dataset<Row> df = spark
    +  .readStream()
    +  .format("kafka")
    +  .option("kafka.bootstrap.servers", "host1:port1,host2:port2")
    +  .option("subscribe", "topic1")
    +  .load()
    +
    +// 1. Decode the Avro data into a struct;
    +// 2. Filter by column `favorite_color`;
    +// 3. Encode the column `name` in Avro format.
    +DataFrame output = df
    --- End diff --
    
    Looks OK except missing a semicolon at the end of the statements.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22121: [SPARK-25133][SQL][Doc]Avro data source guide

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22121
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2445/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22121: [SPARK-25133][SQL][Doc]Avro data source guide

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22121
  
    **[Test build #95113 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95113/testReport)** for PR 22121 at commit [`006ea40`](https://github.com/apache/spark/commit/006ea40ce0d7a3939241c6e0126732e9cebb59ca).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22121: [SPARK-25133][SQL][Doc]Avro data source guide

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22121
  
    **[Test build #94928 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94928/testReport)** for PR 22121 at commit [`8b191bd`](https://github.com/apache/spark/commit/8b191bd37af24ff27b6416ee6af4d885f1c94852).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22121: [SPARK-25133][SQL][Doc]AVRO data source guide

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22121
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22121: [SPARK-25133][SQL][Doc]Avro data source guide

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22121#discussion_r210922376
  
    --- Diff: docs/avro-data-source-guide.md ---
    @@ -0,0 +1,267 @@
    +---
    +layout: global
    +title: Avro Data Source Guide
    --- End diff --
    
    Call it "Apache Avro" in the title and first mention in the paragraph below. Afterwards, just "Avro" is OK.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22121: [SPARK-25133][SQL][Doc]Avro data source guide

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22121
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22121: [SPARK-25133][SQL][Doc]Avro data source guide

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22121
  
    **[Test build #95140 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95140/testReport)** for PR 22121 at commit [`1f253bf`](https://github.com/apache/spark/commit/1f253bf536c3a7bd1c07ba5ea5600f661c8e106e).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22121: [SPARK-25133][SQL][Doc]Avro data source guide

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22121
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22121: [SPARK-25133][SQL][Doc]Avro data source guide

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22121
  
    **[Test build #94905 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94905/testReport)** for PR 22121 at commit [`ff6d3ab`](https://github.com/apache/spark/commit/ff6d3abf1b1f4dec6ec29266a325a2f7bd4fdd05).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22121: [WIP][SPARK-25133][SQL][Doc]Avro data source guid...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22121#discussion_r211126239
  
    --- Diff: docs/avro-data-source-guide.md ---
    @@ -0,0 +1,260 @@
    +---
    +layout: global
    +title: Apache Avro Data Source Guide
    +---
    +
    +* This will become a table of contents (this text will be scraped).
    +{:toc}
    +
    +Since Spark 2.4 release, [Spark SQL](https://spark.apache.org/docs/latest/sql-programming-guide.html) provides built-in support for reading and writing Apache Avro data.
    +
    +## Deploying
    +The `spark-avro` module is external and not included in `spark-submit` or `spark-shell` by default.
    +
    +As with any Spark applications, `spark-submit` is used to launch your application. `spark-avro_{{site.SCALA_BINARY_VERSION}}`
    +and its dependencies can be directly added to `spark-submit` using `--packages`, such as,
    +
    +    ./bin/spark-submit --packages org.apache.spark:spark-avro_{{site.SCALA_BINARY_VERSION}}:{{site.SPARK_VERSION_SHORT}} ...
    +
    +For experimenting on `spark-shell`, you can also use `--packages` to add `org.apache.spark:spark-avro_{{site.SCALA_BINARY_VERSION}}` and its dependencies directly,
    +
    +    ./bin/spark-shell --packages org.apache.spark:spark-avro_{{site.SCALA_BINARY_VERSION}}:{{site.SPARK_VERSION_SHORT}} ...
    +
    +See [Application Submission Guide](submitting-applications.html) for more details about submitting applications with external dependencies.
    +
    +## Load/Save Functions
    +
    +Since `spark-avro` module is external, there is not such API as `.avro` in 
    +`DataFrameReader` or `DataFrameWriter`.
    +To load/save data in Avro format, you need to specify the data source option `format` as short name `avro` or full name `org.apache.spark.sql.avro`.
    +<div class="codetabs">
    +<div data-lang="scala" markdown="1">
    +{% highlight scala %}
    +
    +val usersDF = spark.read.format("avro").load("examples/src/main/resources/users.avro")
    +usersDF.select("name", "favorite_color").write.format("avro").save("namesAndFavColors.avro")
    +
    +{% endhighlight %}
    +</div>
    +<div data-lang="java" markdown="1">
    +{% highlight java %}
    +
    +Dataset<Row> usersDF = spark.read().format("avro").load("examples/src/main/resources/users.avro");
    +usersDF.select("name", "favorite_color").write().format("avro").save("namesAndFavColors.avro");
    +
    +{% endhighlight %}
    +</div>
    +<div data-lang="python" markdown="1">
    +{% highlight python %}
    +
    +df = spark.read.format("avro").load("examples/src/main/resources/users.avro")
    +df.select("name", "favorite_color").write.format("avro").save("namesAndFavColors.avro")
    +
    +{% endhighlight %}
    +</div>
    +<div data-lang="r" markdown="1">
    +{% highlight r %}
    +
    +df <- read.df("examples/src/main/resources/users.avro", "avro")
    +write.df(select(df, "name", "favorite_color"), "namesAndFavColors.avro", "avro")
    +
    +{% endhighlight %}
    +</div>
    +</div>
    +
    +## Data Source Options
    +
    +Data source options of Avro can be set using the `.option` method on `DataFrameReader` or `DataFrameWriter`.
    +<table class="table">
    +  <tr><th><b>Property Name</b></th><th><b>Default</b></th><th><b>Meaning</b></th><th><b>Scope</b></th></tr>
    +  <tr>
    +    <td><code>avroSchema</code></td>
    +    <td>None</td>
    +    <td>Optional Avro schema provided by an user in JSON format.</td>
    +    <td>read and write</td>
    +  </tr>
    +  <tr>
    +    <td><code>recordName</code></td>
    +    <td>topLevelRecord</td>
    +    <td>Top level record name in write result, which is required in Avro spec.</td>
    +    <td>write</td>
    +  </tr>
    +  <tr>
    +    <td><code>recordNamespace</code></td>
    +    <td>""</td>
    +    <td>Record namespace in write result.</td>
    +    <td>write</td>
    +  </tr>
    +  <tr>
    +    <td><code>ignoreExtension</code></td>
    +    <td>true</td>
    +    <td>The option controls ignoring of files without <code>.avro</code> extensions in read. If the option is enabled, all files (with and without <code>.avro</code> extension) are loaded.</td>
    +    <td>read</td>
    +  </tr>
    +  <tr>
    +    <td><code>compression</code></td>
    +    <td>snappy</td>
    +    <td>The <code>compression</code> option allows to specify a compression codec used in write. Currently supported codecs are <code>uncompressed</code>, <code>snappy</code>, <code>deflate</code>, <code>bzip2</code> and <code>xz</code>. If the option is not set, the configuration <code>spark.sql.avro.compression.codec</code> config is taken into account.</td>
    --- End diff --
    
    @gengliangwang, I could check it by myself but thought it's easier to ask to you. Do we now all have the options and configurations existent in spark-avro?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22121: [SPARK-25133][SQL][Doc]Avro data source guide

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22121
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2452/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22121: [SPARK-25133][SQL][Doc]Avro data source guide

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22121
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org