You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2020/05/01 07:49:28 UTC

[GitHub] [incubator-hudi] AakashPradeep opened a new pull request #1580: adding check for table name for Append Save mode HUDI-852

AakashPradeep opened a new pull request #1580:
URL: https://github.com/apache/incubator-hudi/pull/1580


   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contributing.html before opening a pull request.*
   
   ## What is the purpose of the pull request
   
   *Adding check to validate if the table name in the request is same as the table name in the metadata at the path when Save mode is Append*
   
   ## Brief change log
   
   *(for example:)*
     - *Modify AnnotationLocation checkstyle rule in checkstyle.xml*
   
   ## Verify this pull request
   
   *(Please pick either of the following options)*
   
   This pull request is a trivial rework / code cleanup without any test coverage.
   
   *(or)*
   
   This pull request is already covered by existing tests, such as *(please describe tests)*.
   
   (or)
   
   This change added tests and can be verified as follows:
   
   *(example:)*
   
     - *Added a test in HoodieSparkSqlWriterSuite to verify the change.*
   
   ## Committer checklist
   
    - [ ] Has a corresponding JIRA in PR title & commit
    
    - [ ] Commit message is descriptive of the change
    
    - [ ] CI is green
   
    - [ ] Necessary doc changes done or have another open PR
          
    - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1580: [HUDI-852] adding check for table name for Append Save mode

Posted by GitBox <gi...@apache.org>.

vinothchandar commented on a change in pull request #1580:
URL: https://github.com/apache/incubator-hudi/pull/1580#discussion_r419654885



##########
File path: hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala
##########
@@ -83,6 +83,13 @@ private[hudi] object HoodieSparkSqlWriter {
     val fs = basePath.getFileSystem(sparkContext.hadoopConfiguration)
     var exists = fs.exists(new Path(basePath, HoodieTableMetaClient.METAFOLDER_NAME))
 
+    if (exists && mode == SaveMode.Append) {
+      val existingTableName = new HoodieTableMetaClient(sparkContext.hadoopConfiguration, path.get).getTableConfig.getTableName
+      if (!existingTableName.equals(tblName.get)) {
+        throw new HoodieException(s"hoodie table with name $existingTableName already exist at $basePath")

Review comment:
       actually we could have pushed this check into the guts of `hoodie-client`.. Every write will initialize a HoodieTableMetaClient anyway.. And have it error out from inside, it will save this extra tableMetaClient initialization..  @bhasudha let's file a follow up, if you agree




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1580: [HUDI-852] adding check for table name for Append Save mode

Posted by GitBox <gi...@apache.org>.

vinothchandar commented on a change in pull request #1580:
URL: https://github.com/apache/incubator-hudi/pull/1580#discussion_r420174060



##########
File path: hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala
##########
@@ -83,6 +83,13 @@ private[hudi] object HoodieSparkSqlWriter {
     val fs = basePath.getFileSystem(sparkContext.hadoopConfiguration)
     var exists = fs.exists(new Path(basePath, HoodieTableMetaClient.METAFOLDER_NAME))
 
+    if (exists && mode == SaveMode.Append) {
+      val existingTableName = new HoodieTableMetaClient(sparkContext.hadoopConfiguration, path.get).getTableConfig.getTableName
+      if (!existingTableName.equals(tblName.get)) {
+        throw new HoodieException(s"hoodie table with name $existingTableName already exist at $basePath")

Review comment:
       @AakashPradeep Thanks for the follow up.. 
   
   For other save modes, e.g OVERWRITE (we first delete the existing dataset), this does not apply anyway.. So I was wondering if we can enforce this at the `HoodieTable.create()` level, which always reads `hoodie.properties` anyway... We can add a `ValidationUtils.checkArgument()` there to simply check the writeConfig table name matches what we read from props file? 
   
   I will let you and @bhasudha take it from here.. :)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-hudi] AakashPradeep commented on a change in pull request #1580: [HUDI-852] adding check for table name for Append Save mode

Posted by GitBox <gi...@apache.org>.

AakashPradeep commented on a change in pull request #1580:
URL: https://github.com/apache/incubator-hudi/pull/1580#discussion_r419837606



##########
File path: hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala
##########
@@ -83,6 +83,13 @@ private[hudi] object HoodieSparkSqlWriter {
     val fs = basePath.getFileSystem(sparkContext.hadoopConfiguration)
     var exists = fs.exists(new Path(basePath, HoodieTableMetaClient.METAFOLDER_NAME))
 
+    if (exists && mode == SaveMode.Append) {
+      val existingTableName = new HoodieTableMetaClient(sparkContext.hadoopConfiguration, path.get).getTableConfig.getTableName
+      if (!existingTableName.equals(tblName.get)) {
+        throw new HoodieException(s"hoodie table with name $existingTableName already exist at $basePath")

Review comment:
       Thanks for the comment @vinothchandar. 
   
   I would suggest the following : 
   
   1. I can use HoodieTableConfig here instead of HoodieTableMetaClient (which seems little  expensive here)
   
   2. I will explore the hoodie-client code. But I would suggest to either keep all the check based on save mode in this class or move all to hoodie-client.   The earlier it throws exception better it would be, but I would leave that on you guys to decide.
   
   3. If we decide to keep all the checks as it is then I will suggest moving checks at Line number 116 to the beginning of the if section so that we can fail fast and avoid a lot of initialization. Same for the table existence check at 172, it should be moved to the beginning of else section.
   
   Please let me know if it sounds reasonable to you. I can file another Jira for improvement.  
   
   Thanks!




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-hudi] AakashPradeep commented on pull request #1580: adding check for table name for Append Save mode HUDI-852

Posted by GitBox <gi...@apache.org>.

AakashPradeep commented on pull request #1580:
URL: https://github.com/apache/incubator-hudi/pull/1580#issuecomment-622318079


   I did manual verification with spark-shell. If I try to Append data with different table name it throws a HoodieException : 
   
   org.apache.hudi.exception.HoodieException: hoodie table with name hudi_trips_cow already exist at file:/tmp/hudi_trips_cow
   
   scala> df.write.format("hudi").
        |   options(getQuickstartWriteConfigs).
        |   option(PRECOMBINE_FIELD_OPT_KEY, "ts").
        |   option(RECORDKEY_FIELD_OPT_KEY, "uuid").
        |   option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
        |   option(TABLE_NAME, tableName).
        |   mode(Overwrite).
        |   save(basePath)
   20/05/01 02:23:52 WARN HoodieSparkSqlWriter$: hoodie table at file:/tmp/hudi_trips_cow already exists. Deleting existing data & overwriting with new data.
   
   scala> df.write.format("hudi").
        |   options(getQuickstartWriteConfigs).
        |   option(PRECOMBINE_FIELD_OPT_KEY, "ts").
        |   option(RECORDKEY_FIELD_OPT_KEY, "uuid").
        |   option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
        |   option(TABLE_NAME, tableName).
        |   mode(Append).
        |   save(basePath)
   
   scala> df.write.format("hudi").
        |   options(getQuickstartWriteConfigs).
        |   option(PRECOMBINE_FIELD_OPT_KEY, "ts").
        |   option(RECORDKEY_FIELD_OPT_KEY, "uuid").
        |   option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
        |   option(TABLE_NAME, **"foo_table"**).
        |   mode(Append).
        |   save(basePath)
   org.apache.hudi.exception.HoodieException: hoodie table with name hudi_trips_cow already exist at file:/tmp/hudi_trips_cow
     at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:124)
     at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:108)
     at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
     at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
     at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
     at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
     at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
     at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
     at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
     at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
     at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
     at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
     at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
     at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
     at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
     at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
     at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
     at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
     at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
     at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676)
     at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:285)
     at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271)
     at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229)
     ... 68 elided
   
   scala>


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-hudi] AakashPradeep edited a comment on pull request #1580: [HUDI-852] adding check for table name for Append Save mode

Posted by GitBox <gi...@apache.org>.

AakashPradeep edited a comment on pull request #1580:
URL: https://github.com/apache/incubator-hudi/pull/1580#issuecomment-622318079


   I did manual verification with spark-shell. If I try to Append data with different table name it throws a HoodieException : 
   
   org.apache.hudi.exception.HoodieException: hoodie table with name hudi_trips_cow already exist at file:/tmp/hudi_trips_cow
   
   scala> df.write.format("hudi").
        |   options(getQuickstartWriteConfigs).
        |   option(PRECOMBINE_FIELD_OPT_KEY, "ts").
        |   option(RECORDKEY_FIELD_OPT_KEY, "uuid").
        |   option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
        |   option(TABLE_NAME, tableName).
        |   mode(Overwrite).
        |   save(basePath)
   20/05/01 02:23:52 WARN HoodieSparkSqlWriter$: hoodie table at file:/tmp/hudi_trips_cow already exists. Deleting existing data & overwriting with new data.
   
   scala> df.write.format("hudi").
        |   options(getQuickstartWriteConfigs).
        |   option(PRECOMBINE_FIELD_OPT_KEY, "ts").
        |   option(RECORDKEY_FIELD_OPT_KEY, "uuid").
        |   option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
        |   option(TABLE_NAME, tableName).
        |   mode(Append).
        |   save(basePath)
   
   scala> df.write.format("hudi").
        |   options(getQuickstartWriteConfigs).
        |   option(PRECOMBINE_FIELD_OPT_KEY, "ts").
        |   option(RECORDKEY_FIELD_OPT_KEY, "uuid").
        |   option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
        |   option(TABLE_NAME, **"foo_table"**).
        |   mode(Append).
        |   save(basePath)
   org.apache.hudi.exception.HoodieException: hoodie table with name hudi_trips_cow already exist at file:/tmp/hudi_trips_cow
     at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:124)
     at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:108)
     at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
     at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
     at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
     at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
     at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
     at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
     at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
     at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
     at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
     at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
     at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
     at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
     at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
     at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
     at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
     at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
     at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
     at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676)
     at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:285)
     at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271)
     at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229)
     ... 68 elided
   
   scala>
   
    scala> df.write.format("hudi").
        |   options(getQuickstartWriteConfigs).
        |   **option(OPERATION_OPT_KEY,"delete").**
        |   option(PRECOMBINE_FIELD_OPT_KEY, "ts").
        |   option(RECORDKEY_FIELD_OPT_KEY, "uuid").
        |   option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
        |   option(TABLE_NAME, **"foo_table"**).
        |   mode(Append).
        |   save(basePath)
   org.apache.hudi.exception.HoodieException: hoodie table with name hudi_trips_cow already exist at file:/tmp/hudi_trips_cow
     at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:89)
     at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:108)
     at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
     at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
     at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
     at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
     at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
     at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
     at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
     at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
     at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
     at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
     at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
     at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
     at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
     at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
     at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
     at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
     at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
     at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676)
     at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:285)
     at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271)
     at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229)
     ... 69 elided
   
   scala>


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-hudi] AakashPradeep edited a comment on pull request #1580: [HUDI-852] adding check for table name for Append Save mode

Posted by GitBox <gi...@apache.org>.

AakashPradeep edited a comment on pull request #1580:
URL: https://github.com/apache/incubator-hudi/pull/1580#issuecomment-623205984


   Now it checks for the existing table name with delete and upsert opertions when Append is the SaveMode.
   
   exceptions
   ---
   
   scala> df.write.format("hudi").
        |   options(getQuickstartWriteConfigs).
        |   option(PRECOMBINE_FIELD_OPT_KEY, "ts").
        |   option(RECORDKEY_FIELD_OPT_KEY, "uuid").
        |   option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
        |   option(TABLE_NAME, "foo_table").
        |   mode(Append).
        |   save(basePath)
   org.apache.hudi.exception.HoodieException: hoodie table with name hudi_trips_cow already exist at file:/tmp/hudi_trips_cow
     at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:89)
     at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:108)
     at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
     at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
     at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
     at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
     at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
     at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
     at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
     at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
     at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
     at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
     at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
     at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
     at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
     at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
     at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
     at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
     at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
     at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676)
     at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:285)
     at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271)
     at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229)
     ... 68 elided
   
   scala> df.write.format("hudi").
        |   options(getQuickstartWriteConfigs).
        |   option(OPERATION_OPT_KEY,"delete").
        |   option(PRECOMBINE_FIELD_OPT_KEY, "ts").
        |   option(RECORDKEY_FIELD_OPT_KEY, "uuid").
        |   option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
        |   option(TABLE_NAME, tableName).
        |   mode(Append).
        |   save(basePath)
   
   scala> df.write.format("hudi").
        |   options(getQuickstartWriteConfigs).
        |   option(OPERATION_OPT_KEY,"delete").
        |   option(PRECOMBINE_FIELD_OPT_KEY, "ts").
        |   option(RECORDKEY_FIELD_OPT_KEY, "uuid").
        |   option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
        |   option(TABLE_NAME, "foo_table").
        |   mode(Append).
        |   save(basePath)
   org.apache.hudi.exception.HoodieException: hoodie table with name hudi_trips_cow already exist at file:/tmp/hudi_trips_cow
     at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:89)
     at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:108)
     at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
     at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
     at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
     at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
     at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
     at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
     at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
     at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
     at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
     at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
     at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
     at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
     at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
     at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
     at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
     at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
     at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
     at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676)
     at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:285)
     at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271)
     at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229)
     ... 69 elided
   
   
   and work for valid table name 
   -------------
   
   scala> df.write.format("hudi").
        |   options(getQuickstartWriteConfigs).
        |   option(PRECOMBINE_FIELD_OPT_KEY, "ts").
        |   option(RECORDKEY_FIELD_OPT_KEY, "uuid").
        |   option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
        |   option(TABLE_NAME, tableName).
        |   mode(Append).
        |   save(basePath)
   
   scala>


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-hudi] AakashPradeep commented on a change in pull request #1580: [HUDI-852] adding check for table name for Append Save mode

Posted by GitBox <gi...@apache.org>.

AakashPradeep commented on a change in pull request #1580:
URL: https://github.com/apache/incubator-hudi/pull/1580#discussion_r419837606



##########
File path: hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala
##########
@@ -83,6 +83,13 @@ private[hudi] object HoodieSparkSqlWriter {
     val fs = basePath.getFileSystem(sparkContext.hadoopConfiguration)
     var exists = fs.exists(new Path(basePath, HoodieTableMetaClient.METAFOLDER_NAME))
 
+    if (exists && mode == SaveMode.Append) {
+      val existingTableName = new HoodieTableMetaClient(sparkContext.hadoopConfiguration, path.get).getTableConfig.getTableName
+      if (!existingTableName.equals(tblName.get)) {
+        throw new HoodieException(s"hoodie table with name $existingTableName already exist at $basePath")

Review comment:
       Thanks for the comment @vinothchandar. 
   
   I would suggest the following : 
   
   1. I can use HoodieTableConfig here instead of HoodieTableMetaClient (which seems little  expensive here)
   
   2. I would suggest keeping this check here since the rest of the check based on SaveMode is in this class only and IMHO HoodieTableMetaClient seems like responsible for reading the metadata and the writer should make decisions based on the metadata. Please correct me if I am wrong.
   
   3. I will suggest moving checks at Line number 116 to the beginning of the if section so that we can fail fast and avoid a lot of initialization. Same for the table existence check at 172, it should be moved to the beginning of else section.
   
   Please let me know if it sounds reasonable to you. I can file another Jira for improvement.  
   
   Thanks!




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-hudi] bhasudha commented on a change in pull request #1580: adding check for table name for Append Save mode HUDI-852

Posted by GitBox <gi...@apache.org>.

bhasudha commented on a change in pull request #1580:
URL: https://github.com/apache/incubator-hudi/pull/1580#discussion_r418964519



##########
File path: hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala
##########
@@ -118,6 +118,12 @@ private[hudi] object HoodieSparkSqlWriter {
         fs.delete(basePath, true)
         exists = false
       }
+      if (exists && mode == SaveMode.Append) {

Review comment:
       I think this check is applicable to both the if and else clause. Can you verify that and move it to may be line [85](https://github.com/apache/incubator-hudi/pull/1580/files#diff-5317f4121df875e406876f9f0f012facR85) (before the if clause) ?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-hudi] AakashPradeep commented on a change in pull request #1580: [HUDI-852] adding check for table name for Append Save mode

Posted by GitBox <gi...@apache.org>.

AakashPradeep commented on a change in pull request #1580:
URL: https://github.com/apache/incubator-hudi/pull/1580#discussion_r419180984



##########
File path: hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala
##########
@@ -118,6 +118,12 @@ private[hudi] object HoodieSparkSqlWriter {
         fs.delete(basePath, true)
         exists = false
       }
+      if (exists && mode == SaveMode.Append) {

Review comment:
       @bhasudha please review! 
   
   Thanks!




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-hudi] bhasudha commented on pull request #1580: adding check for table name for Append Save mode HUDI-852

Posted by GitBox <gi...@apache.org>.

bhasudha commented on pull request #1580:
URL: https://github.com/apache/incubator-hudi/pull/1580#issuecomment-622961553


   @AakashPradeep Also can you change the PR title to start with [HUDI-852] ? 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-hudi] AakashPradeep commented on pull request #1580: [HUDI-852] adding check for table name for Append Save mode

Posted by GitBox <gi...@apache.org>.

AakashPradeep commented on pull request #1580:
URL: https://github.com/apache/incubator-hudi/pull/1580#issuecomment-623205984


   Now it checks for the existing table name with delete and upsert opertions when Append is the SaveMode.
   
   exceptions
   ---
   
   scala> df.write.format("hudi").
        |   options(getQuickstartWriteConfigs).
        |   option(PRECOMBINE_FIELD_OPT_KEY, "ts").
        |   option(RECORDKEY_FIELD_OPT_KEY, "uuid").
        |   option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
        |   option(TABLE_NAME, "foo_table").
        |   mode(Append).
        |   save(basePath)
   org.apache.hudi.exception.HoodieException: hoodie table with name hudi_trips_cow already exist at file:/tmp/hudi_trips_cow
     at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:89)
     at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:108)
     at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
     at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
     at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
     at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
     at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
     at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
     at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
     at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
     at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
     at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
     at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
     at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
     at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
     at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
     at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
     at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
     at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
     at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676)
     at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:285)
     at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271)
     at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229)
     ... 68 elided
   
   scala> df.write.format("hudi").
        |   options(getQuickstartWriteConfigs).
        |   option(OPERATION_OPT_KEY,"delete").
        |   option(PRECOMBINE_FIELD_OPT_KEY, "ts").
        |   option(RECORDKEY_FIELD_OPT_KEY, "uuid").
        |   option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
        |   option(TABLE_NAME, tableName).
        |   mode(Append).
        |   save(basePath)
   
   scala> df.write.format("hudi").
        |   options(getQuickstartWriteConfigs).
        |   option(OPERATION_OPT_KEY,"delete").
        |   option(PRECOMBINE_FIELD_OPT_KEY, "ts").
        |   option(RECORDKEY_FIELD_OPT_KEY, "uuid").
        |   option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
        |   option(TABLE_NAME, "foo_table").
        |   mode(Append).
        |   save(basePath)
   org.apache.hudi.exception.HoodieException: hoodie table with name hudi_trips_cow already exist at file:/tmp/hudi_trips_cow
     at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:89)
     at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:108)
     at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
     at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
     at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
     at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
     at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
     at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
     at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
     at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
     at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
     at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
     at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
     at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
     at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
     at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
     at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
     at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
     at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
     at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676)
     at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:285)
     at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271)
     at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229)
     ... 69 elided
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-hudi] AakashPradeep commented on a change in pull request #1580: adding check for table name for Append Save mode HUDI-852

Posted by GitBox <gi...@apache.org>.

AakashPradeep commented on a change in pull request #1580:
URL: https://github.com/apache/incubator-hudi/pull/1580#discussion_r419177304



##########
File path: hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala
##########
@@ -118,6 +118,12 @@ private[hudi] object HoodieSparkSqlWriter {
         fs.delete(basePath, true)
         exists = false
       }
+      if (exists && mode == SaveMode.Append) {

Review comment:
       Thanks for the comment. I have updated the code.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-hudi] AakashPradeep commented on a change in pull request #1580: [HUDI-852] adding check for table name for Append Save mode

Posted by GitBox <gi...@apache.org>.

AakashPradeep commented on a change in pull request #1580:
URL: https://github.com/apache/incubator-hudi/pull/1580#discussion_r420282414



##########
File path: hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala
##########
@@ -83,6 +83,13 @@ private[hudi] object HoodieSparkSqlWriter {
     val fs = basePath.getFileSystem(sparkContext.hadoopConfiguration)
     var exists = fs.exists(new Path(basePath, HoodieTableMetaClient.METAFOLDER_NAME))
 
+    if (exists && mode == SaveMode.Append) {
+      val existingTableName = new HoodieTableMetaClient(sparkContext.hadoopConfiguration, path.get).getTableConfig.getTableName
+      if (!existingTableName.equals(tblName.get)) {
+        throw new HoodieException(s"hoodie table with name $existingTableName already exist at $basePath")

Review comment:
       Thanks !
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-hudi] bhasudha commented on pull request #1580: [HUDI-852] adding check for table name for Append Save mode

Posted by GitBox <gi...@apache.org>.

bhasudha commented on pull request #1580:
URL: https://github.com/apache/incubator-hudi/pull/1580#issuecomment-623275628


   LGtM. Merging.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org