You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by arunmahadevan <gi...@git.apache.org> on 2018/08/28 08:04:12 UTC

[GitHub] spark pull request #22251: [SPARK-25260][SQL] Fix namespace handling in Sche...

GitHub user arunmahadevan opened a pull request:

    https://github.com/apache/spark/pull/22251

    [SPARK-25260][SQL] Fix namespace handling in SchemaConverters.toAvroType

    ## What changes were proposed in this pull request?
    
    `toAvroType` converts spark data type to avro schema. It always appends the record name to namespace so its impossible to have an Avro namespace independent of the record name.
    
     
    When invoked with a spark data type like,
    
    ```java
    val sparkSchema = StructType(Seq(
        StructField("name", StringType, nullable = false),
        StructField("address", StructType(Seq(
            StructField("city", StringType, nullable = false),
            StructField("state", StringType, nullable = false))),
        nullable = false)))
     
    // map it to an avro schema with record name "employee" and top level namespace "foo.bar",
    val avroSchema = SchemaConverters.toAvroType(sparkSchema,  false, "employee", "foo.bar")
    
    // result is
    // avroSchema.getName = employee
    // avroSchema.getNamespace = foo.bar.employee
    // avroSchema.getFullname = foo.bar.employee.employee
    ```
    The patch proposes to fix this so that the result is
    
    ```
    avroSchema.getName = employee
    avroSchema.getNamespace = foo.bar
    avroSchema.getFullname = foo.bar.employee
    ```
    ## How was this patch tested?
    
    New and existing unit tests.
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/arunmahadevan/spark avro-fix

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/22251.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #22251
    
----
commit f47483951e12d563b7696940a2cfc2fdc3b27ab2
Author: Arun Mahadevan <ar...@...>
Date:   2018-08-28T08:00:17Z

    [SPARK-25260][SQL] Fix namespace handling in SchemaConverters.toAvroType

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22251: [SPARK-25260][SQL] Fix namespace handling in SchemaConve...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/22251
  
    Merged to master.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22251: [SPARK-25260][SQL] Fix namespace handling in SchemaConve...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22251
  
    **[Test build #95367 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95367/testReport)** for PR 22251 at commit [`2153428`](https://github.com/apache/spark/commit/2153428f62dca7d462d17c85576426ecca52d1cb).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22251: [SPARK-25260][SQL] Fix namespace handling in SchemaConve...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22251
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22251: [SPARK-25260][SQL] Fix namespace handling in SchemaConve...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22251
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22251: [SPARK-25260][SQL] Fix namespace handling in Sche...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22251#discussion_r213232867
  
    --- Diff: external/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala ---
    @@ -1099,6 +1098,27 @@ class AvroSuite extends QueryTest with SharedSQLContext with SQLTestUtils {
         }
       }
     
    +  test("check namespace - toAvroType") {
    --- End diff --
    
    @arunmahadevan, can we add a simple end-to-end test as well?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22251: [SPARK-25260][SQL] Fix namespace handling in SchemaConve...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22251
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95338/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22251: [SPARK-25260][SQL] Fix namespace handling in SchemaConve...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22251
  
    **[Test build #95367 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95367/testReport)** for PR 22251 at commit [`2153428`](https://github.com/apache/spark/commit/2153428f62dca7d462d17c85576426ecca52d1cb).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22251: [SPARK-25260][SQL] Fix namespace handling in Sche...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/22251


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22251: [SPARK-25260][SQL] Fix namespace handling in SchemaConve...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22251
  
    **[Test build #95338 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95338/testReport)** for PR 22251 at commit [`f474839`](https://github.com/apache/spark/commit/f47483951e12d563b7696940a2cfc2fdc3b27ab2).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22251: [SPARK-25260][SQL] Fix namespace handling in Sche...

Posted by gengliangwang <gi...@git.apache.org>.
Github user gengliangwang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22251#discussion_r213231887
  
    --- Diff: external/avro/src/main/scala/org/apache/spark/sql/avro/SchemaConverters.scala ---
    @@ -143,29 +143,25 @@ object SchemaConverters {
             val avroType = LogicalTypes.decimal(d.precision, d.scale)
             val fixedSize = minBytesForPrecision(d.precision)
             // Need to avoid naming conflict for the fixed fields
    -        val name = prevNameSpace match {
    +        val name = nameSpace match {
               case "" => s"$recordName.fixed"
    -          case _ => s"$prevNameSpace.$recordName.fixed"
    +          case _ => s"$nameSpace.$recordName.fixed"
             }
             avroType.addToSchema(SchemaBuilder.fixed(name).size(fixedSize))
     
           case BinaryType => builder.bytesType()
           case ArrayType(et, containsNull) =>
             builder.array()
    -          .items(toAvroType(et, containsNull, recordName, prevNameSpace))
    +          .items(toAvroType(et, containsNull, recordName, nameSpace))
           case MapType(StringType, vt, valueContainsNull) =>
             builder.map()
    -          .values(toAvroType(vt, valueContainsNull, recordName, prevNameSpace))
    +          .values(toAvroType(vt, valueContainsNull, recordName, nameSpace))
           case st: StructType =>
    -        val nameSpace = prevNameSpace match {
    -          case "" => recordName
    -          case _ => s"$prevNameSpace.$recordName"
    -        }
    -
    +        val childNameSpace = if (nameSpace != "") s"$nameSpace.$recordName" else recordName
             val fieldsAssembler = builder.record(recordName).namespace(nameSpace).fields()
    --- End diff --
    
    +1, this line is the only difference for the whole code change. The namespace here should not be the one with `recordName` at the end.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22251: [SPARK-25260][SQL] Fix namespace handling in Sche...

Posted by arunmahadevan <gi...@git.apache.org>.
Github user arunmahadevan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22251#discussion_r213407441
  
    --- Diff: external/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala ---
    @@ -1099,6 +1098,27 @@ class AvroSuite extends QueryTest with SharedSQLContext with SQLTestUtils {
         }
       }
     
    +  test("check namespace - toAvroType") {
    --- End diff --
    
    Its sort of covered in the below existing cases. Do you think we need more?
    
    [Validate namespace in avro file that has nested records with the same name](https://github.com/apache/spark/blob/master/external/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala#L1078)
    [conversion to avro and back with namespace](https://github.com/apache/spark/blob/master/external/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala#L510)


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22251: [SPARK-25260][SQL] Fix namespace handling in SchemaConve...

Posted by arunmahadevan <gi...@git.apache.org>.
Github user arunmahadevan commented on the issue:

    https://github.com/apache/spark/pull/22251
  
    cc @gengliangwang @dongjoon-hyun


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22251: [SPARK-25260][SQL] Fix namespace handling in SchemaConve...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22251
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95367/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22251: [SPARK-25260][SQL] Fix namespace handling in SchemaConve...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22251
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22251: [SPARK-25260][SQL] Fix namespace handling in Sche...

Posted by gengliangwang <gi...@git.apache.org>.
Github user gengliangwang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22251#discussion_r213233392
  
    --- Diff: external/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala ---
    @@ -1099,6 +1098,27 @@ class AvroSuite extends QueryTest with SharedSQLContext with SQLTestUtils {
         }
       }
     
    +  test("check namespace - toAvroType") {
    +    val sparkSchema = StructType(Seq(
    +      StructField("name", StringType, nullable = false),
    +      StructField("address", StructType(Seq(
    +        StructField("city", StringType, nullable = false),
    +        StructField("state", StringType, nullable = false))),
    +        nullable = false)))
    +    val employeeType = SchemaConverters.toAvroType(sparkSchema,
    +      recordName = "employee",
    +      nameSpace = "foo.bar")
    --- End diff --
    
    nit: could you also add a case for `nameSpace` as `""` ? 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22251: [SPARK-25260][SQL] Fix namespace handling in Sche...

Posted by arunmahadevan <gi...@git.apache.org>.
Github user arunmahadevan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22251#discussion_r213407599
  
    --- Diff: external/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala ---
    @@ -1099,6 +1098,27 @@ class AvroSuite extends QueryTest with SharedSQLContext with SQLTestUtils {
         }
       }
     
    +  test("check namespace - toAvroType") {
    +    val sparkSchema = StructType(Seq(
    +      StructField("name", StringType, nullable = false),
    +      StructField("address", StructType(Seq(
    +        StructField("city", StringType, nullable = false),
    +        StructField("state", StringType, nullable = false))),
    +        nullable = false)))
    +    val employeeType = SchemaConverters.toAvroType(sparkSchema,
    +      recordName = "employee",
    +      nameSpace = "foo.bar")
    --- End diff --
    
    Added a test case for toAvroType with empty namespace


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22251: [SPARK-25260][SQL] Fix namespace handling in SchemaConve...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22251
  
    **[Test build #95338 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95338/testReport)** for PR 22251 at commit [`f474839`](https://github.com/apache/spark/commit/f47483951e12d563b7696940a2cfc2fdc3b27ab2).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22251: [SPARK-25260][SQL] Fix namespace handling in SchemaConve...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22251
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org