You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2020/01/18 18:40:27 UTC

[GitHub] [incubator-hudi] nsivabalan opened a new pull request #1248: Adding delete docs to QuickStart

nsivabalan opened a new pull request #1248: Adding delete docs to QuickStart
URL: https://github.com/apache/incubator-hudi/pull/1248
 
 
   - Adding delete docs to QuickStart
   
   ## Verify this pull request
   This pull request is a trivial rework / code cleanup without any test coverage.
   Verified locally.
   
   ## Committer checklist
   
    - [ ] Has a corresponding JIRA in PR title & commit
    
    - [ ] Commit message is descriptive of the change
    
    - [ ] CI is green
   
    - [ ] Necessary doc changes done or have another open PR
          
    - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1248: Adding delete docs to QuickStart

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on a change in pull request #1248: Adding delete docs to QuickStart
URL: https://github.com/apache/incubator-hudi/pull/1248#discussion_r368248589
 
 

 ##########
 File path: docs/quickstart.md
 ##########
 @@ -109,6 +109,57 @@ Notice that the save mode is now `Append`. In general, always use append mode un
 [Querying](#query) the data again will now show updated trips. Each write operation generates a new [commit](http://hudi.incubator.apache.org/concepts.html) 
 denoted by the timestamp. Look for changes in `_hoodie_commit_time`, `rider`, `driver` fields for the same `_hoodie_record_key`s in previous commit. 
 
+## Delete data {#deletes}
+Delete records for the HoodieKeys passed in. Lets first generate a new batch of insert and delete the same. Query to verify
+that all records are deleted.
+
+```
+val inserts = convertToStringList(dataGen.generateInserts(10))
+val df = spark.read.json(spark.sparkContext.parallelize(inserts, 2))
+df.write.format("org.apache.hudi").
+    options(getQuickstartWriteConfigs).
+    option(PRECOMBINE_FIELD_OPT_KEY, "ts").
+    option(RECORDKEY_FIELD_OPT_KEY, "uuid").
+    option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
+    option(TABLE_NAME, tableName).
+    mode(Overwrite).
+    save(basePath);
+
+// Fetch the rider value for the batch of records inserted just now
+val roDeleteViewDF = spark.
+    read.
+    format("org.apache.hudi").
+    load(basePath + "/*/*/*/*")
+roDeleteViewDF.registerTempTable("hudi_ro_table")
+spark.sql("select distinct rider from  hudi_ro_table where").show()
+
+// replace the rider value in below query to a value from above. "rider-213" is first batch and "rider-284" is second batch.
+val ds = spark.sql("select uuid, partitionPath from hudi_ro_table where rider = 'rider-284'")
+
+// issue deletes
 
 Review comment:
   lets just delete a few existing records? and show that.. you can use `.limit(2)` to say get just 2 records out of the existing table and delete it 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] bhasudha commented on issue #1248: [Minor] Adding delete docs to QuickStart

Posted by GitBox <gi...@apache.org>.
bhasudha commented on issue #1248: [Minor] Adding delete docs to QuickStart
URL: https://github.com/apache/incubator-hudi/pull/1248#issuecomment-577525944
 
 
   > @bhasudha : can you check if things are looking good now.
   
   LGTM. Merging this.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1248: Adding delete docs to QuickStart

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on a change in pull request #1248: Adding delete docs to QuickStart
URL: https://github.com/apache/incubator-hudi/pull/1248#discussion_r368242317
 
 

 ##########
 File path: docs/quickstart.md
 ##########
 @@ -109,6 +109,57 @@ Notice that the save mode is now `Append`. In general, always use append mode un
 [Querying](#query) the data again will now show updated trips. Each write operation generates a new [commit](http://hudi.incubator.apache.org/concepts.html) 
 denoted by the timestamp. Look for changes in `_hoodie_commit_time`, `rider`, `driver` fields for the same `_hoodie_record_key`s in previous commit. 
 
+## Delete data {#deletes}
+Delete records for the HoodieKeys passed in. Lets first generate a new batch of insert and delete the same. Query to verify
+that all records are deleted.
+
+```
+val inserts = convertToStringList(dataGen.generateInserts(10))
+val df = spark.read.json(spark.sparkContext.parallelize(inserts, 2))
+df.write.format("org.apache.hudi").
+    options(getQuickstartWriteConfigs).
+    option(PRECOMBINE_FIELD_OPT_KEY, "ts").
+    option(RECORDKEY_FIELD_OPT_KEY, "uuid").
+    option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
+    option(TABLE_NAME, tableName).
+    mode(Overwrite).
+    save(basePath);
+
+// Fetch the rider value for the batch of records inserted just now
+val roDeleteViewDF = spark.
+    read.
+    format("org.apache.hudi").
+    load(basePath + "/*/*/*/*")
+roDeleteViewDF.registerTempTable("hudi_ro_table")
+spark.sql("select distinct rider from  hudi_ro_table where").show()
+
+// replace the rider value in below query to a value from above. "rider-213" is first batch and "rider-284" is second batch.
+val ds = spark.sql("select uuid, partitionPath from hudi_ro_table where rider = 'rider-284'")
+
+// issue deletes
 
 Review comment:
   ideally,. we just have this part and remove everything above this line, to keep the quickstart small 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] bhasudha commented on issue #1248: [Minor] Adding delete docs to QuickStart

Posted by GitBox <gi...@apache.org>.
bhasudha commented on issue #1248: [Minor] Adding delete docs to QuickStart
URL: https://github.com/apache/incubator-hudi/pull/1248#issuecomment-576830791
 
 
   @nsivabalan can you fix the conflict and update the PR again? I think it should be good to merge then. 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] vinothchandar commented on issue #1248: Adding delete docs to QuickStart

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on issue #1248: Adding delete docs to QuickStart
URL: https://github.com/apache/incubator-hudi/pull/1248#issuecomment-575928283
 
 
   yes.. we will update the site, as we release 0.5.1. its on my and sudhas plate 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1248: Adding delete docs to QuickStart

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on a change in pull request #1248: Adding delete docs to QuickStart
URL: https://github.com/apache/incubator-hudi/pull/1248#discussion_r368242273
 
 

 ##########
 File path: docs/quickstart.md
 ##########
 @@ -109,6 +109,57 @@ Notice that the save mode is now `Append`. In general, always use append mode un
 [Querying](#query) the data again will now show updated trips. Each write operation generates a new [commit](http://hudi.incubator.apache.org/concepts.html) 
 denoted by the timestamp. Look for changes in `_hoodie_commit_time`, `rider`, `driver` fields for the same `_hoodie_record_key`s in previous commit. 
 
+## Delete data {#deletes}
+Delete records for the HoodieKeys passed in. Lets first generate a new batch of insert and delete the same. Query to verify
+that all records are deleted.
+
+```
+val inserts = convertToStringList(dataGen.generateInserts(10))
 
 Review comment:
   could we avoid doing the insert again?  can we not reuse from the insert/update done above? 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] nsivabalan commented on a change in pull request #1248: Adding delete docs to QuickStart

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on a change in pull request #1248: Adding delete docs to QuickStart
URL: https://github.com/apache/incubator-hudi/pull/1248#discussion_r368248561
 
 

 ##########
 File path: docs/quickstart.md
 ##########
 @@ -109,6 +109,57 @@ Notice that the save mode is now `Append`. In general, always use append mode un
 [Querying](#query) the data again will now show updated trips. Each write operation generates a new [commit](http://hudi.incubator.apache.org/concepts.html) 
 denoted by the timestamp. Look for changes in `_hoodie_commit_time`, `rider`, `driver` fields for the same `_hoodie_record_key`s in previous commit. 
 
+## Delete data {#deletes}
+Delete records for the HoodieKeys passed in. Lets first generate a new batch of insert and delete the same. Query to verify
+that all records are deleted.
+
+```
+val inserts = convertToStringList(dataGen.generateInserts(10))
+val df = spark.read.json(spark.sparkContext.parallelize(inserts, 2))
+df.write.format("org.apache.hudi").
+    options(getQuickstartWriteConfigs).
+    option(PRECOMBINE_FIELD_OPT_KEY, "ts").
+    option(RECORDKEY_FIELD_OPT_KEY, "uuid").
+    option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
+    option(TABLE_NAME, tableName).
+    mode(Overwrite).
+    save(basePath);
+
+// Fetch the rider value for the batch of records inserted just now
+val roDeleteViewDF = spark.
+    read.
+    format("org.apache.hudi").
+    load(basePath + "/*/*/*/*")
+roDeleteViewDF.registerTempTable("hudi_ro_table")
+spark.sql("select distinct rider from  hudi_ro_table where").show()
+
+// replace the rider value in below query to a value from above. "rider-213" is first batch and "rider-284" is second batch.
+val ds = spark.sql("select uuid, partitionPath from hudi_ro_table where rider = 'rider-284'")
+
+// issue deletes
 
 Review comment:
   So, if I do the same with initial insert batch, then all records will be deleted. But don't want to disrupt the flow for rest of the quick start.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] nsivabalan commented on a change in pull request #1248: Adding delete docs to QuickStart

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on a change in pull request #1248: Adding delete docs to QuickStart
URL: https://github.com/apache/incubator-hudi/pull/1248#discussion_r368248588
 
 

 ##########
 File path: docs/quickstart.md
 ##########
 @@ -109,6 +109,57 @@ Notice that the save mode is now `Append`. In general, always use append mode un
 [Querying](#query) the data again will now show updated trips. Each write operation generates a new [commit](http://hudi.incubator.apache.org/concepts.html) 
 denoted by the timestamp. Look for changes in `_hoodie_commit_time`, `rider`, `driver` fields for the same `_hoodie_record_key`s in previous commit. 
 
+## Delete data {#deletes}
+Delete records for the HoodieKeys passed in. Lets first generate a new batch of insert and delete the same. Query to verify
+that all records are deleted.
+
+```
+val inserts = convertToStringList(dataGen.generateInserts(10))
+val df = spark.read.json(spark.sparkContext.parallelize(inserts, 2))
+df.write.format("org.apache.hudi").
+    options(getQuickstartWriteConfigs).
+    option(PRECOMBINE_FIELD_OPT_KEY, "ts").
+    option(RECORDKEY_FIELD_OPT_KEY, "uuid").
+    option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
+    option(TABLE_NAME, tableName).
+    mode(Overwrite).
+    save(basePath);
+
+// Fetch the rider value for the batch of records inserted just now
+val roDeleteViewDF = spark.
+    read.
+    format("org.apache.hudi").
+    load(basePath + "/*/*/*/*")
+roDeleteViewDF.registerTempTable("hudi_ro_table")
+spark.sql("select distinct rider from  hudi_ro_table where").show()
+
+// replace the rider value in below query to a value from above. "rider-213" is first batch and "rider-284" is second batch.
+val ds = spark.sql("select uuid, partitionPath from hudi_ro_table where rider = 'rider-284'")
+
+// issue deletes
 
 Review comment:
   I can move it as the last section. Hope thats fine. 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] nsivabalan commented on issue #1248: [Minor] Adding delete docs to QuickStart

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #1248: [Minor] Adding delete docs to QuickStart
URL: https://github.com/apache/incubator-hudi/pull/1248#issuecomment-577190257
 
 
   @bhasudha : can you check if things are looking good now. 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1248: Adding delete docs to QuickStart

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on a change in pull request #1248: Adding delete docs to QuickStart
URL: https://github.com/apache/incubator-hudi/pull/1248#discussion_r368248651
 
 

 ##########
 File path: docs/quickstart.md
 ##########
 @@ -109,6 +109,57 @@ Notice that the save mode is now `Append`. In general, always use append mode un
 [Querying](#query) the data again will now show updated trips. Each write operation generates a new [commit](http://hudi.incubator.apache.org/concepts.html) 
 denoted by the timestamp. Look for changes in `_hoodie_commit_time`, `rider`, `driver` fields for the same `_hoodie_record_key`s in previous commit. 
 
+## Delete data {#deletes}
+Delete records for the HoodieKeys passed in. Lets first generate a new batch of insert and delete the same. Query to verify
+that all records are deleted.
+
+```
+val inserts = convertToStringList(dataGen.generateInserts(10))
+val df = spark.read.json(spark.sparkContext.parallelize(inserts, 2))
+df.write.format("org.apache.hudi").
+    options(getQuickstartWriteConfigs).
+    option(PRECOMBINE_FIELD_OPT_KEY, "ts").
+    option(RECORDKEY_FIELD_OPT_KEY, "uuid").
+    option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
+    option(TABLE_NAME, tableName).
+    mode(Overwrite).
+    save(basePath);
+
+// Fetch the rider value for the batch of records inserted just now
+val roDeleteViewDF = spark.
+    read.
+    format("org.apache.hudi").
+    load(basePath + "/*/*/*/*")
+roDeleteViewDF.registerTempTable("hudi_ro_table")
+spark.sql("select distinct rider from  hudi_ro_table where").show()
+
+// replace the rider value in below query to a value from above. "rider-213" is first batch and "rider-284" is second batch.
+val ds = spark.sql("select uuid, partitionPath from hudi_ro_table where rider = 'rider-284'")
+
+// issue deletes
 
 Review comment:
   Lets have it after incremental query.. deletes will conclude the flow of writing and reading nicely

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] nsivabalan commented on a change in pull request #1248: Adding delete docs to QuickStart

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on a change in pull request #1248: Adding delete docs to QuickStart
URL: https://github.com/apache/incubator-hudi/pull/1248#discussion_r368248364
 
 

 ##########
 File path: docs/quickstart.md
 ##########
 @@ -109,6 +109,57 @@ Notice that the save mode is now `Append`. In general, always use append mode un
 [Querying](#query) the data again will now show updated trips. Each write operation generates a new [commit](http://hudi.incubator.apache.org/concepts.html) 
 denoted by the timestamp. Look for changes in `_hoodie_commit_time`, `rider`, `driver` fields for the same `_hoodie_record_key`s in previous commit. 
 
+## Delete data {#deletes}
+Delete records for the HoodieKeys passed in. Lets first generate a new batch of insert and delete the same. Query to verify
+that all records are deleted.
+
+```
+val inserts = convertToStringList(dataGen.generateInserts(10))
+val df = spark.read.json(spark.sparkContext.parallelize(inserts, 2))
+df.write.format("org.apache.hudi").
+    options(getQuickstartWriteConfigs).
+    option(PRECOMBINE_FIELD_OPT_KEY, "ts").
+    option(RECORDKEY_FIELD_OPT_KEY, "uuid").
+    option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
+    option(TABLE_NAME, tableName).
+    mode(Overwrite).
+    save(basePath);
+
+// Fetch the rider value for the batch of records inserted just now
+val roDeleteViewDF = spark.
+    read.
+    format("org.apache.hudi").
+    load(basePath + "/*/*/*/*")
+roDeleteViewDF.registerTempTable("hudi_ro_table")
+spark.sql("select distinct rider from  hudi_ro_table where").show()
+
+// replace the rider value in below query to a value from above. "rider-213" is first batch and "rider-284" is second batch.
+val ds = spark.sql("select uuid, partitionPath from hudi_ro_table where rider = 'rider-284'")
+
+// issue deletes
 
 Review comment:
   I am deleting an entire batch of inserts and hence thought will do a new batch of inserts and delete the entire batch. 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] nsivabalan commented on issue #1248: Adding delete docs to QuickStart

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #1248: Adding delete docs to QuickStart
URL: https://github.com/apache/incubator-hudi/pull/1248#issuecomment-575927177
 
 
   @bhasudha @vinothchandar : while I am at it, do you guys think we can fix the spark set up instructions in quick start "Hudi works with Spark-2.x versions. You can follow instructions here for setting up spark. From the extracted directory run spark-shell with Hudi as:"
   
   To add info that both local spark and the spark version passed in --packages should match. 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] bhasudha merged pull request #1248: [Minor] Adding delete docs to QuickStart

Posted by GitBox <gi...@apache.org>.
bhasudha merged pull request #1248: [Minor] Adding delete docs to QuickStart
URL: https://github.com/apache/incubator-hudi/pull/1248
 
 
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services