You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by "nsivabalan (via GitHub)" <gi...@apache.org> on 2023/03/07 20:32:56 UTC

[GitHub] [hudi] nsivabalan opened a new pull request, #8115: [HUDI-5864] Adding file system view refresh regression to our release page

nsivabalan opened a new pull request, #8115:
URL: https://github.com/apache/hudi/pull/8115

   ### Change Logs
   
   Adding file system view refresh regression to our release page
   
   ### Impact
   
   Caution users about the regression. 
   
   ### Risk level (write none, low medium or high below)
   
   low. 
   
   ### Documentation Update
   
   This patch is updating our release page with regressions deducted with some of the hudi releases.
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] danny0405 commented on a diff in pull request #8115: [HUDI-5864] Adding file system view refresh regression to our release page

Posted by "danny0405 (via GitHub)" <gi...@apache.org>.

danny0405 commented on code in PR #8115:
URL: https://github.com/apache/hudi/pull/8115#discussion_r1128963844


##########
website/releases/release-0.11.0.md:
##########
@@ -201,6 +201,17 @@ detailed settings.
 In 0.11.0, `org.apache.hudi.utilities.schema.HiveSchemaProvider` is added for getting schema from user-defined hive
 tables. This is useful when tailing Hive tables in `HoodieDeltaStreamer` instead of having to provide avro schema files.
 
+## Known Regression
+
+In 0.11.0 release, with the newly added support for Spark SQL features, the following performance regressions were
+inadvertently introduced:
+* Partition pruning for some of the COW tables is not applied properly
+* Spark SQL query caching (which caches parsed and resolved queries) was not working correctly resulting in additional
+* overhead to re-analyze the query every time when it's executed (listing the table contents, etc.)
+

Review Comment:
   In release 0.11.0 we also have a critical regression for data loss: https://github.com/apache/hudi/pull/6179 has fixed it. Need to address it here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] danny0405 commented on a diff in pull request #8115: [HUDI-5864] Adding file system view refresh regression to our release page

Posted by "danny0405 (via GitHub)" <gi...@apache.org>.

danny0405 commented on code in PR #8115:
URL: https://github.com/apache/hudi/pull/8115#discussion_r1130357173


##########
website/releases/release-0.11.0.md:
##########
@@ -201,6 +201,17 @@ detailed settings.
 In 0.11.0, `org.apache.hudi.utilities.schema.HiveSchemaProvider` is added for getting schema from user-defined hive
 tables. This is useful when tailing Hive tables in `HoodieDeltaStreamer` instead of having to provide avro schema files.
 
+## Known Regression
+
+In 0.11.0 release, with the newly added support for Spark SQL features, the following performance regressions were
+inadvertently introduced:
+* Partition pruning for some of the COW tables is not applied properly
+* Spark SQL query caching (which caches parsed and resolved queries) was not working correctly resulting in additional
+* overhead to re-analyze the query every time when it's executed (listing the table contents, etc.)
+

Review Comment:
   Yeah, would fire a PR soon ~



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan merged pull request #8115: [HUDI-5864] Adding file system view refresh regression to our release page

Posted by "nsivabalan (via GitHub)" <gi...@apache.org>.

nsivabalan merged PR #8115:
URL: https://github.com/apache/hudi/pull/8115


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] yihua commented on a diff in pull request #8115: [HUDI-5864] Adding file system view refresh regression to our release page

Posted by "yihua (via GitHub)" <gi...@apache.org>.

yihua commented on code in PR #8115:
URL: https://github.com/apache/hudi/pull/8115#discussion_r1130013648


##########
website/releases/release-0.11.0.md:
##########
@@ -201,6 +201,17 @@ detailed settings.
 In 0.11.0, `org.apache.hudi.utilities.schema.HiveSchemaProvider` is added for getting schema from user-defined hive
 tables. This is useful when tailing Hive tables in `HoodieDeltaStreamer` instead of having to provide avro schema files.
 
+## Known Regression
+
+In 0.11.0 release, with the newly added support for Spark SQL features, the following performance regressions were
+inadvertently introduced:
+* Partition pruning for some of the COW tables is not applied properly
+* Spark SQL query caching (which caches parsed and resolved queries) was not working correctly resulting in additional
+* overhead to re-analyze the query every time when it's executed (listing the table contents, etc.)
+

Review Comment:
   @danny0405 could you put up a PR to add the information to the release notes?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] danny0405 commented on a diff in pull request #8115: [HUDI-5864] Adding file system view refresh regression to our release page

Posted by "danny0405 (via GitHub)" <gi...@apache.org>.

danny0405 commented on code in PR #8115:
URL: https://github.com/apache/hudi/pull/8115#discussion_r1128958955


##########
website/releases/release-0.12.0.md:
##########
@@ -200,6 +200,29 @@ getting duplicate records in your pipeline:
 - Making sure that the [fix](https://github.com/apache/hudi/pull/6883) is
   included in your custom artifacts (if you're building and using ones)
 
+
+We also found another regression related to metadata table and timeline server interplay with streaming ingestion pipelines.
+
+The FileSystemView that Hudi maintains internally could go out of sync due to a occasional race conditions when table services are involved
+(compaction, clustering) and could result in updates and deletes routed to older file versions and hence resulting in missed updates and deletes.
+
+Here are the user-flows that could potentially be impacted with this.
+
+- This impacts pipelines using Deltastreamer in **continuous mode** (sync once is not impacted), Spark streaming, or if you have been directly
+  using write client across batches/commits instead of the standard ways to write to Hudi. In other words, batch writes should not be impacted.
+- Among these write models, this could have an impact only when table services are enabled.
+    - COW: clustering enabled (inline or async)
+    - MOR: compaction enabled (by default, inline or async)
+- Also, the impact is applicable only when metadata table is enabled, and timeline server is enabled (which are defaults as of 0.12.0)
+
+Based on some production data, we expect this issue might impact roughly < 1% of updates to be missed, since its a race condition
+and table services are generally scheduled once every N commits. The percentage of update misses could be even less if the
+frequency of table services is less.
+
+[Here](https://issues.apache.org/jira/browse/HUDI-5863) is the jira for the issue of interest and the fix has already been landed in master.

Review Comment:
   Recently, we found another critical regression for flink metadata sync: https://github.com/apache/hudi/pull/8050, which would cause object reference leak and has risk of OOM for long running streaming job, can we also address it for 0.12.x and 0.13.x release.
   
   The job would crush down seems like after continuous running about 2 weeks.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org