You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2021/02/10 10:31:23 UTC

[GitHub] [spark] uzadude opened a new pull request #31543: [SPARK-34416] Adding support for user provided schema url in Avro

uzadude opened a new pull request #31543:
URL: https://github.com/apache/spark/pull/31543


   ### What changes were proposed in this pull request?
   
   Added option to provide Avro schema by URL.
   
   ### Why are the changes needed?
   (copied from Jira ticket)
   
   We have a use case in which we read a huge table in Avro format. About 30k columns.
   
   using the default Hive reader - `AvroGenericRecordReader` it is just hangs forever. after 4 hours not even one task has finished.
   
   We tried instead to use `spark.read.format("com.databricks.spark.avro").load(..)` but we failed on:
   
   ```
   
   org.apache.spark.sql.AnalysisException: Found duplicate column(s) in the data schema
   
   ..
   
   at org.apache.spark.sql.util.SchemaUtils$.checkColumnNameDuplication(SchemaUtils.scala:85)
   at org.apache.spark.sql.util.SchemaUtils$.checkColumnNameDuplication(SchemaUtils.scala:67)
   at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:421)
   at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:174)
   ... 53 elided
   
   ```
   
    
   
   because files schema contain duplicate column names (when considering case-insensitive).
   
   So we wanted to provide a user schema with non-duplicated fields, but the schema is huge. a few MBs. it is not practical to provide it in json format.
   
    
   
   So we patched spark-avro to be able to get also `avroSchemaUrl` in addition to `avroSchema` and it worked perfectly.
   
   
   ### How was this patch tested?
   added a unitest to AvroSuite and tested locally with patched version
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #31543: [SPARK-34416] Adding support for user provided schema url in Avro

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #31543:
URL: https://github.com/apache/spark/pull/31543#discussion_r575764801



##########
File path: external/avro/src/main/scala/org/apache/spark/sql/avro/AvroOptions.scala
##########
@@ -47,7 +51,25 @@ private[sql] class AvroOptions(
    * schema converted by Spark. For example, the expected schema of one column is of "enum" type,
    * instead of "string" type in the default converted schema.
    */
-  val schema: Option[String] = parameters.get("avroSchema")
+  val schema: Option[Schema] = {
+

Review comment:
       Could you remove this empty line?

##########
File path: external/avro/src/main/scala/org/apache/spark/sql/avro/AvroOptions.scala
##########
@@ -47,7 +51,25 @@ private[sql] class AvroOptions(
    * schema converted by Spark. For example, the expected schema of one column is of "enum" type,
    * instead of "string" type in the default converted schema.
    */
-  val schema: Option[String] = parameters.get("avroSchema")
+  val schema: Option[Schema] = {
+
+    parameters.get("avroSchema").map(new Schema.Parser().parse).orElse({
+

Review comment:
       ditto.

##########
File path: external/avro/src/main/scala/org/apache/spark/sql/avro/AvroOptions.scala
##########
@@ -47,7 +51,25 @@ private[sql] class AvroOptions(
    * schema converted by Spark. For example, the expected schema of one column is of "enum" type,
    * instead of "string" type in the default converted schema.
    */
-  val schema: Option[String] = parameters.get("avroSchema")
+  val schema: Option[Schema] = {
+
+    parameters.get("avroSchema").map(new Schema.Parser().parse).orElse({
+
+      val avroUrlSchema = parameters.get("avroSchemaUrl").map(schemaFSUrl => {
+        log.info("loading avro schema from url: " + schemaFSUrl)
+        val fs = FileSystem.get(new URI(schemaFSUrl), conf)
+        val in = fs.open(new Path(schemaFSUrl))
+        try {
+          new Schema.Parser().parse(in)
+        } finally {
+          in.close()
+        }
+      })
+

Review comment:
       ditto.

##########
File path: external/avro/src/main/scala/org/apache/spark/sql/avro/AvroOptions.scala
##########
@@ -47,7 +51,25 @@ private[sql] class AvroOptions(
    * schema converted by Spark. For example, the expected schema of one column is of "enum" type,
    * instead of "string" type in the default converted schema.
    */
-  val schema: Option[String] = parameters.get("avroSchema")
+  val schema: Option[Schema] = {
+
+    parameters.get("avroSchema").map(new Schema.Parser().parse).orElse({
+
+      val avroUrlSchema = parameters.get("avroSchemaUrl").map(schemaFSUrl => {
+        log.info("loading avro schema from url: " + schemaFSUrl)
+        val fs = FileSystem.get(new URI(schemaFSUrl), conf)
+        val in = fs.open(new Path(schemaFSUrl))
+        try {
+          new Schema.Parser().parse(in)
+        } finally {
+          in.close()
+        }
+      })
+
+      avroUrlSchema
+    })
+

Review comment:
       ditto.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] uzadude removed a comment on pull request #31543: [SPARK-34416] Adding support for user provided schema url in Avro

Posted by GitBox <gi...@apache.org>.
uzadude removed a comment on pull request #31543:
URL: https://github.com/apache/spark/pull/31543#issuecomment-778616988


   retest this please


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #31543: [SPARK-34416] Adding support for user provided schema url in Avro

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #31543:
URL: https://github.com/apache/spark/pull/31543#discussion_r575765191



##########
File path: external/avro/src/main/scala/org/apache/spark/sql/avro/AvroOptions.scala
##########
@@ -47,7 +51,25 @@ private[sql] class AvroOptions(
    * schema converted by Spark. For example, the expected schema of one column is of "enum" type,
    * instead of "string" type in the default converted schema.
    */
-  val schema: Option[String] = parameters.get("avroSchema")
+  val schema: Option[Schema] = {
+
+    parameters.get("avroSchema").map(new Schema.Parser().parse).orElse({
+
+      val avroUrlSchema = parameters.get("avroSchemaUrl").map(schemaFSUrl => {

Review comment:
       `schemaFSUrl` -> `url`




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #31543: [SPARK-34416] Adding support for user provided schema url in Avro

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #31543:
URL: https://github.com/apache/spark/pull/31543#discussion_r575765283



##########
File path: external/avro/src/main/scala/org/apache/spark/sql/avro/AvroOptions.scala
##########
@@ -47,7 +51,25 @@ private[sql] class AvroOptions(
    * schema converted by Spark. For example, the expected schema of one column is of "enum" type,
    * instead of "string" type in the default converted schema.
    */
-  val schema: Option[String] = parameters.get("avroSchema")
+  val schema: Option[Schema] = {
+
+    parameters.get("avroSchema").map(new Schema.Parser().parse).orElse({
+
+      val avroUrlSchema = parameters.get("avroSchemaUrl").map(schemaFSUrl => {
+        log.info("loading avro schema from url: " + schemaFSUrl)

Review comment:
       Shall we lower the log level to `debug`?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun closed pull request #31543: [SPARK-34416][SQL] Adding support for user provided schema url in Avro

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun closed pull request #31543:
URL: https://github.com/apache/spark/pull/31543


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] uzadude commented on a change in pull request #31543: [SPARK-34416] Adding support for user provided schema url in Avro

Posted by GitBox <gi...@apache.org>.
uzadude commented on a change in pull request #31543:
URL: https://github.com/apache/spark/pull/31543#discussion_r575624357



##########
File path: external/avro/src/main/scala/org/apache/spark/sql/avro/AvroOptions.scala
##########
@@ -47,7 +51,21 @@ private[sql] class AvroOptions(
    * schema converted by Spark. For example, the expected schema of one column is of "enum" type,
    * instead of "string" type in the default converted schema.
    */
-  val schema: Option[String] = parameters.get("avroSchema")
+  val schema: Option[Schema] = {
+
+    val avroUrlSchema = parameters.get("avroSchemaUrl").map(schemaFSUrl => {
+      log.info("loading avro schema from url: " + schemaFSUrl)
+      val fs = FileSystem.get(new URI(schemaFSUrl), conf)
+      val in = fs.open(new Path(schemaFSUrl))

Review comment:
       1. this is the behavior when using Avro Hive table. see `org.apache.hadoop.hive.serde2.avro.AvroSerdeUtils.determineSchemaOrThrowException()`. so now when we're trying to move users from the Avro table to `spark.read.` they have a regression.
   2. some of the users use mainly sql code. it will be cumbersome for them to write this logic every time.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] uzadude commented on pull request #31543: [SPARK-34416] Adding support for user provided schema url in Avro

Posted by GitBox <gi...@apache.org>.
uzadude commented on pull request #31543:
URL: https://github.com/apache/spark/pull/31543#issuecomment-778566917


   @dongjoon-hyun - agree, done.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #31543: [SPARK-34416] Adding support for user provided schema url in Avro

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #31543:
URL: https://github.com/apache/spark/pull/31543#discussion_r575765476



##########
File path: external/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala
##########
@@ -725,6 +725,16 @@ abstract class AvroSuite
     assert(result.sameElements(expected))
   }
 
+  test("support user provided avro schema url") {
+    val avroSchemaUrl = testFile("test_sub.avsc")
+    val result = spark.read.option("avroSchemaUrl", avroSchemaUrl)
+      .format("avro")
+      .load(testAvro)
+      .collect()
+    val expected = spark.read.format("avro").load(testAvro).select("string").collect()
+    assert(result.sameElements(expected))
+  }

Review comment:
       Could you add more test cases? For example,
   1. A test coverage for both `avroSchema` and `avroSchemaUrl` exist.
   2. A negative test case for a wrong `avroSchemaUrl`.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #31543: [SPARK-34416] Adding support for user provided schema url in Avro

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #31543:
URL: https://github.com/apache/spark/pull/31543#discussion_r575620933



##########
File path: external/avro/src/main/scala/org/apache/spark/sql/avro/AvroOptions.scala
##########
@@ -47,7 +51,21 @@ private[sql] class AvroOptions(
    * schema converted by Spark. For example, the expected schema of one column is of "enum" type,
    * instead of "string" type in the default converted schema.
    */
-  val schema: Option[String] = parameters.get("avroSchema")
+  val schema: Option[Schema] = {
+
+    val avroUrlSchema = parameters.get("avroSchemaUrl").map(schemaFSUrl => {
+      log.info("loading avro schema from url: " + schemaFSUrl)
+      val fs = FileSystem.get(new URI(schemaFSUrl), conf)
+      val in = fs.open(new Path(schemaFSUrl))

Review comment:
       I feel like this is just a small code bit that can be just done in user's application side.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] uzadude commented on pull request #31543: [SPARK-34416] Adding support for user provided schema url in Avro

Posted by GitBox <gi...@apache.org>.
uzadude commented on pull request #31543:
URL: https://github.com/apache/spark/pull/31543#issuecomment-778616988


   retest this please


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #31543: [SPARK-34416] Adding support for user provided schema url in Avro

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #31543:
URL: https://github.com/apache/spark/pull/31543#discussion_r575764900



##########
File path: external/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala
##########
@@ -725,6 +725,16 @@ abstract class AvroSuite
     assert(result.sameElements(expected))
   }
 
+  test("support user provided avro schema url") {

Review comment:
       Please add a test name prefix, `SPARK-34416: `.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31543: [SPARK-34416] Adding support for user provided schema url in Avro

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31543:
URL: https://github.com/apache/spark/pull/31543#issuecomment-776611215


   Can one of the admins verify this patch?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #31543: [SPARK-34416] Adding support for user provided schema url in Avro

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #31543:
URL: https://github.com/apache/spark/pull/31543#discussion_r575764991



##########
File path: external/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala
##########
@@ -702,7 +702,7 @@ abstract class AvroSuite
     }
   }
 
-  test("support user provided avro schema") {
+  test("support user provided avro schema string") {

Review comment:
       I guess it's not required to change this test case name.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #31543: [SPARK-34416] Adding support for user provided schema url in Avro

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #31543:
URL: https://github.com/apache/spark/pull/31543#issuecomment-778480557


   Also, cc @gengliangwang .


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org