You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/09/06 13:37:25 UTC

[GitHub] [hudi] loukey-lj opened a new pull request, #6612: [HUDI-4790] a more effective HoodieMergeHandler for COW table with parquet

loukey-lj opened a new pull request, #6612:
URL: https://github.com/apache/hudi/pull/6612

   ### Change Logs
   
   _Describe context and summary for this change. Highlight if any code was copied._
   
   ### Impact
   
   _Describe any public API or user-facing feature change or any performance impact._
   
   **Risk level: none | low | medium | high**
   
   _Choose one. If medium or high, explain what verification was done to mitigate the risks._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] loukey-lj commented on pull request #6612: [HUDI-4790] a more effective HoodieMergeHandler for COW table with parquet

Posted by GitBox <gi...@apache.org>.
loukey-lj commented on PR #6612:
URL: https://github.com/apache/hudi/pull/6612#issuecomment-1242881701

   @nsivabalan @danny0405  Thanks for review. I updated comment.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nsivabalan commented on pull request #6612: [RFC-58][HUDI-4790] a more effective HoodieMergeHandler for COW table with parquet

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on PR #6612:
URL: https://github.com/apache/hudi/pull/6612#issuecomment-1299519063

   @loukey-lj : can you respond to @guanziyue 's comment above. I will review this patch by this week. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] loukey-lj commented on a diff in pull request #6612: [RFC-58][HUDI-4790] a more effective HoodieMergeHandler for COW table with parquet

Posted by GitBox <gi...@apache.org>.
loukey-lj commented on code in PR #6612:
URL: https://github.com/apache/hudi/pull/6612#discussion_r1060261800


##########
rfc/rfc-58/rfc-58.md:
##########
@@ -0,0 +1,100 @@
+<!--  Licensed to the Apache Software Foundation (ASF) under one or more  contributor license agreements. See the NOTICE file distributed with  this work for additional information regarding copyright ownership.  The ASF licenses this file to You under the Apache License, Version 2.0  (the "License"); you may not use this file except in compliance with  the License. You may obtain a copy of the License at
+
+<pre class="md-fences md-end-block ty-contain-cm modeLoaded" spellcheck="false" lang="" cid="n2" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: var(--monospace); font-size: 0.9em; display: block; break-inside: avoid; text-align: left; white-space: normal; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(248, 248, 248); position: relative !important; border: 1px solid rgb(231, 234, 237); border-radius: 3px; padding: 8px 4px 6px; margin-bottom: 15px; margin-top: 15px; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-thickness: initial; text-decorati
 on-style: initial; text-decoration-color: initial;"> http://www.apache.org/licenses/LICENSE-2.0</pre>
+
+Unless required by applicable law or agreed to in writing, software  distributed under the License is distributed on an "AS IS" BASIS,  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.  See the License for the specific language governing permissions and  limitations under the License.  -->
+
+# RFC-58: A more effective HoodieMergeHandler for COW table with parquet
+
+## Proposers
+
+*   @loukey-lj
+
+## Approvers
+
+*   @<approver1 github username>
+
+*   @<approver2 github username>
+
+## Status
+
+JIRA: https://issues.apache.org/jira/browse/HUDI-4790
+
+> Please keep the status updated in `rfc/README.md`.
+
+## Abstract
+
+To provide a more effective HoodieMergeHandler for COW table with parquet. Hudi rewrite whole parquet file every COW, that costs a lot in De/serialization and De/compression.  To decrease this cost, a 'surgery' is introduced, which rebuilds a new parquet from an old one,  just copying unchanged rowGroups and overwriting changed rowGroups when updating parquet files.
+
+## Background
+
+*   Parquet is a columnar storage format. All data is horizontally divided into row groups. A row group contains the column chunks of all columns in the interval corresponding to this row group. A column chunk is composed of pages, which are compression and coding units. 
+
+*   In current version of Hudi, a complex De/serialization and De/compression happens every time upserting long tail data on COW, which causes giant CPU/IO cost.
+
+*   The purpose of current RFC aims to decrease costs of De/serialization and De/compression in upserting.  Try to think about the reality, if we know which row groups need to be updated and even more the columns need to be updated in these row groups, we can skip much data's de/serialization and de/compression. That brings giant improvement.

Review Comment:
   It has nothing to do with what payload is used. It is important to know which columns need to be updated and which columns do not need to be updated. If we know which columns need to be updated, even if OverwriteWithLatestAvroPayload is used, it can be partially updated. The copy of rowGroup is applicable to all Payloads. My current scenario is based on merge into. The updated columns come from the syntax parsing of SQL, and then are set in conf



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] loukey-lj commented on pull request #6612: [HUDI-4790][RFC-68] a more effective HoodieMergeHandler for COW table with parquet

Posted by "loukey-lj (via GitHub)" <gi...@apache.org>.
loukey-lj commented on PR #6612:
URL: https://github.com/apache/hudi/pull/6612#issuecomment-1563733359

   > Is this RFC only valid for SQL update scenarios, because it can parse out which columns have been updated from SQL statement. But in other scenarios, such as the "mysql -> debezium -> kafka -> hudi" scenario, we have no way of knowing which columns are updated unless additional calculations are spent, so it can't be applied immediately, right?
   
   This applies not only to partial field update scenarios, but also to entire row updates


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] loukey-lj commented on pull request #6612: [RFC-58][HUDI-4790] a more effective HoodieMergeHandler for COW table with parquet

Posted by GitBox <gi...@apache.org>.
loukey-lj commented on PR #6612:
URL: https://github.com/apache/hudi/pull/6612#issuecomment-1352637903

   > 
   
   I don't know if I can fully support schema evolution. I hope to improve this function with the help of the community. I will write a small demo as soon as possible


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] vinothchandar commented on pull request #6612: [RFC-58][HUDI-4790] a more effective HoodieMergeHandler for COW table with parquet

Posted by "vinothchandar (via GitHub)" <gi...@apache.org>.
vinothchandar commented on PR #6612:
URL: https://github.com/apache/hudi/pull/6612#issuecomment-1480372894

   @loukey-lj still interested in driving this? Its a great idea. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] guanziyue commented on pull request #6612: [RFC-58][HUDI-4790] a more effective HoodieMergeHandler for COW table with parquet

Posted by GitBox <gi...@apache.org>.
guanziyue commented on PR #6612:
URL: https://github.com/apache/hudi/pull/6612#issuecomment-1352617591

   > > @loukey-lj : can you respond to @guanziyue 's comment above. I will review this patch by this week.
   > 
   > Yes, this optimization is applicable to other frameworks. For hudi, its advantage is that it can get rowgroups and store them in the index while updating the index. For schema evolution, we currently only support adding fields. Different rowgroups in the Parquet file can have different schmeas, but this is unknown to the query side. If schema changes are not considered, I can submit a small demo
   
   Thanks for your reply. Agree that this idea can improve performance a lot theoretically. It worries me that current parquet implementation or interface cannot fully support this idea. Looking forward to this RFC!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] loukey-lj commented on pull request #6612: [RFC-58][HUDI-4790] a more effective HoodieMergeHandler for COW table with parquet

Posted by GitBox <gi...@apache.org>.
loukey-lj commented on PR #6612:
URL: https://github.com/apache/hudi/pull/6612#issuecomment-1352505945

   > @loukey-lj : can you respond to @guanziyue 's comment above. I will review this patch by this week.
   
   Yes, this optimization is applicable to other frameworks. For hudi, its advantage is that it can get rowgroups and store them in the index while updating the index. For schema evolution, we currently only support adding fields. Different rowgroups in the Parquet file can have different schmeas, but this is unknown to the query side. If schema changes are not considered, I can submit a small demo


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] guanziyue commented on pull request #6612: [HUDI-4790] a more effective HoodieMergeHandler for COW table with parquet

Posted by GitBox <gi...@apache.org>.
guanziyue commented on PR #6612:
URL: https://github.com/apache/hudi/pull/6612#issuecomment-1244835989

   Hi loukey-lj, excited to hear a fantastic idea. 
   May I know if you have tried part of this idea? For example, updating parquet file actually is not bounded to hudi framework. We can have a unit test by directly rewrite a file only depending on parquet api. As far as I know, parquet file requests schema to be unique among all row groups. Do we have a mechanism to solve this once the row group we write in latest commit has an evolved or devolved schema?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] danny0405 commented on pull request #6612: [HUDI-4790] a more effective HoodieMergeHandler for COW table with parquet

Posted by GitBox <gi...@apache.org>.
danny0405 commented on PR #6612:
URL: https://github.com/apache/hudi/pull/6612#issuecomment-1238824801

   Overall an interesting idea, let put the details in the document.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6612: [HUDI-4790] a more effective HoodieMergeHandler for COW table with parquet

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6612:
URL: https://github.com/apache/hudi/pull/6612#issuecomment-1238794683

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "a990d7b411e5692568e548f4b31394f1fd051e77",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "a990d7b411e5692568e548f4b31394f1fd051e77",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * a990d7b411e5692568e548f4b31394f1fd051e77 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] guanziyue commented on pull request #6612: [RFC-58][HUDI-4790] a more effective HoodieMergeHandler for COW table with parquet

Posted by GitBox <gi...@apache.org>.
guanziyue commented on PR #6612:
URL: https://github.com/apache/hudi/pull/6612#issuecomment-1363485277

   > From this class, maybe you can have a general understanding of the parquet partial update implementation https://github.com/loukey-lj/hudi/tree/partial-update hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodiePartialUpdateHandle.java
   
   Wow! This code shows your idea clearly. Thanks for your clarification. I found parquet internal API is used in this code. I believe the schema evolution problem I mentioned can be resolved by this way. Looking forward to this RFC!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] loukey-lj commented on a diff in pull request #6612: [RFC-58][HUDI-4790] a more effective HoodieMergeHandler for COW table with parquet

Posted by "loukey-lj (via GitHub)" <gi...@apache.org>.
loukey-lj commented on code in PR #6612:
URL: https://github.com/apache/hudi/pull/6612#discussion_r1174561359


##########
rfc/rfc-58/rfc-58.md:
##########
@@ -0,0 +1,100 @@
+<!--  Licensed to the Apache Software Foundation (ASF) under one or more  contributor license agreements. See the NOTICE file distributed with  this work for additional information regarding copyright ownership.  The ASF licenses this file to You under the Apache License, Version 2.0  (the "License"); you may not use this file except in compliance with  the License. You may obtain a copy of the License at
+
+<pre class="md-fences md-end-block ty-contain-cm modeLoaded" spellcheck="false" lang="" cid="n2" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: var(--monospace); font-size: 0.9em; display: block; break-inside: avoid; text-align: left; white-space: normal; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(248, 248, 248); position: relative !important; border: 1px solid rgb(231, 234, 237); border-radius: 3px; padding: 8px 4px 6px; margin-bottom: 15px; margin-top: 15px; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-thickness: initial; text-decorati
 on-style: initial; text-decoration-color: initial;"> http://www.apache.org/licenses/LICENSE-2.0</pre>
+
+Unless required by applicable law or agreed to in writing, software  distributed under the License is distributed on an "AS IS" BASIS,  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.  See the License for the specific language governing permissions and  limitations under the License.  -->
+
+# RFC-58: A more effective HoodieMergeHandler for COW table with parquet
+
+## Proposers
+
+*   @loukey-lj
+
+## Approvers
+
+*   @<approver1 github username>
+
+*   @<approver2 github username>
+
+## Status
+
+JIRA: https://issues.apache.org/jira/browse/HUDI-4790
+
+> Please keep the status updated in `rfc/README.md`.
+
+## Abstract
+
+To provide a more effective HoodieMergeHandler for COW table with parquet. Hudi rewrite whole parquet file every COW, that costs a lot in De/serialization and De/compression.  To decrease this cost, a 'surgery' is introduced, which rebuilds a new parquet from an old one,  just copying unchanged rowGroups and overwriting changed rowGroups when updating parquet files.

Review Comment:
   a)  If the column is not updated, then the page does not need to be decompressed, and if the data in the page is updated, the page needs to be deserialized and read out one by one
   b)Our rowgroup size is 30M, if the parquet file has only one rowgroup, it will not benefit from rowgroup skipping



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nsivabalan commented on a diff in pull request #6612: [RFC-58][HUDI-4790] a more effective HoodieMergeHandler for COW table with parquet

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on code in PR #6612:
URL: https://github.com/apache/hudi/pull/6612#discussion_r1060262680


##########
rfc/rfc-58/rfc-58.md:
##########
@@ -0,0 +1,100 @@
+<!--  Licensed to the Apache Software Foundation (ASF) under one or more  contributor license agreements. See the NOTICE file distributed with  this work for additional information regarding copyright ownership.  The ASF licenses this file to You under the Apache License, Version 2.0  (the "License"); you may not use this file except in compliance with  the License. You may obtain a copy of the License at
+
+<pre class="md-fences md-end-block ty-contain-cm modeLoaded" spellcheck="false" lang="" cid="n2" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: var(--monospace); font-size: 0.9em; display: block; break-inside: avoid; text-align: left; white-space: normal; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(248, 248, 248); position: relative !important; border: 1px solid rgb(231, 234, 237); border-radius: 3px; padding: 8px 4px 6px; margin-bottom: 15px; margin-top: 15px; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-thickness: initial; text-decorati
 on-style: initial; text-decoration-color: initial;"> http://www.apache.org/licenses/LICENSE-2.0</pre>
+
+Unless required by applicable law or agreed to in writing, software  distributed under the License is distributed on an "AS IS" BASIS,  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.  See the License for the specific language governing permissions and  limitations under the License.  -->
+
+# RFC-58: A more effective HoodieMergeHandler for COW table with parquet
+
+## Proposers
+
+*   @loukey-lj
+
+## Approvers
+
+*   @<approver1 github username>
+
+*   @<approver2 github username>
+
+## Status
+
+JIRA: https://issues.apache.org/jira/browse/HUDI-4790
+
+> Please keep the status updated in `rfc/README.md`.
+
+## Abstract
+
+To provide a more effective HoodieMergeHandler for COW table with parquet. Hudi rewrite whole parquet file every COW, that costs a lot in De/serialization and De/compression.  To decrease this cost, a 'surgery' is introduced, which rebuilds a new parquet from an old one,  just copying unchanged rowGroups and overwriting changed rowGroups when updating parquet files.
+
+## Background
+
+*   Parquet is a columnar storage format. All data is horizontally divided into row groups. A row group contains the column chunks of all columns in the interval corresponding to this row group. A column chunk is composed of pages, which are compression and coding units. 
+
+*   In current version of Hudi, a complex De/serialization and De/compression happens every time upserting long tail data on COW, which causes giant CPU/IO cost.
+
+*   The purpose of current RFC aims to decrease costs of De/serialization and De/compression in upserting.  Try to think about the reality, if we know which row groups need to be updated and even more the columns need to be updated in these row groups, we can skip much data's de/serialization and de/compression. That brings giant improvement.

Review Comment:
   I get it. my point was. in case of OverwriteWithLatestAvroPayload, new record is going to contain every column. and unless we read the old record from disk and deser, we never know which column is being updated. Infact, we have an optimization here, where in we don't even deser old record from storage incase of OverwriteWithLatestAvroPayload, bcoz we are going to overide entire record anyways. 
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] yihua commented on pull request #6612: [HUDI-4790][RFC-68] a more effective HoodieMergeHandler for COW table with parquet

Posted by "yihua (via GitHub)" <gi...@apache.org>.
yihua commented on PR #6612:
URL: https://github.com/apache/hudi/pull/6612#issuecomment-1555117524

   @loukey-lj I updated the RFC number for you.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] loukey-lj commented on pull request #6612: [RFC-58][HUDI-4790] a more effective HoodieMergeHandler for COW table with parquet

Posted by "loukey-lj (via GitHub)" <gi...@apache.org>.
loukey-lj commented on PR #6612:
URL: https://github.com/apache/hudi/pull/6612#issuecomment-1495609247

   > @loukey-lj still interested in driving this? Its a great idea.
   
   Of course, hopefully the community will merge this RFC first


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] yihua merged pull request #6612: [HUDI-4790][RFC-68] a more effective HoodieMergeHandler for COW table with parquet

Posted by "yihua (via GitHub)" <gi...@apache.org>.
yihua merged PR #6612:
URL: https://github.com/apache/hudi/pull/6612


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] waitingF commented on pull request #6612: [HUDI-4790][RFC-68] a more effective HoodieMergeHandler for COW table with parquet

Posted by "waitingF (via GitHub)" <gi...@apache.org>.
waitingF commented on PR #6612:
URL: https://github.com/apache/hudi/pull/6612#issuecomment-1560545364

   Is this RFC only valid for SQL update scenarios, because it can parse out which columns have been updated from SQL statement. But in other scenarios, such as the "mysql -> debezium -> kafka -> hudi" scenario, we have no way of knowing which columns are updated unless additional calculations are spent, so it can't be applied immediately, right?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nsivabalan commented on a diff in pull request #6612: [RFC-58][HUDI-4790] a more effective HoodieMergeHandler for COW table with parquet

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on code in PR #6612:
URL: https://github.com/apache/hudi/pull/6612#discussion_r1060205868


##########
rfc/rfc-58/rfc-58.md:
##########
@@ -0,0 +1,100 @@
+<!--  Licensed to the Apache Software Foundation (ASF) under one or more  contributor license agreements. See the NOTICE file distributed with  this work for additional information regarding copyright ownership.  The ASF licenses this file to You under the Apache License, Version 2.0  (the "License"); you may not use this file except in compliance with  the License. You may obtain a copy of the License at
+
+<pre class="md-fences md-end-block ty-contain-cm modeLoaded" spellcheck="false" lang="" cid="n2" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: var(--monospace); font-size: 0.9em; display: block; break-inside: avoid; text-align: left; white-space: normal; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(248, 248, 248); position: relative !important; border: 1px solid rgb(231, 234, 237); border-radius: 3px; padding: 8px 4px 6px; margin-bottom: 15px; margin-top: 15px; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-thickness: initial; text-decorati
 on-style: initial; text-decoration-color: initial;"> http://www.apache.org/licenses/LICENSE-2.0</pre>
+
+Unless required by applicable law or agreed to in writing, software  distributed under the License is distributed on an "AS IS" BASIS,  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.  See the License for the specific language governing permissions and  limitations under the License.  -->
+
+# RFC-58: A more effective HoodieMergeHandler for COW table with parquet
+
+## Proposers
+
+*   @loukey-lj
+
+## Approvers
+
+*   @<approver1 github username>
+
+*   @<approver2 github username>
+
+## Status
+
+JIRA: https://issues.apache.org/jira/browse/HUDI-4790
+
+> Please keep the status updated in `rfc/README.md`.
+
+## Abstract
+
+To provide a more effective HoodieMergeHandler for COW table with parquet. Hudi rewrite whole parquet file every COW, that costs a lot in De/serialization and De/compression.  To decrease this cost, a 'surgery' is introduced, which rebuilds a new parquet from an old one,  just copying unchanged rowGroups and overwriting changed rowGroups when updating parquet files.
+
+## Background
+
+*   Parquet is a columnar storage format. All data is horizontally divided into row groups. A row group contains the column chunks of all columns in the interval corresponding to this row group. A column chunk is composed of pages, which are compression and coding units. 
+
+*   In current version of Hudi, a complex De/serialization and De/compression happens every time upserting long tail data on COW, which causes giant CPU/IO cost.
+
+*   The purpose of current RFC aims to decrease costs of De/serialization and De/compression in upserting.  Try to think about the reality, if we know which row groups need to be updated and even more the columns need to be updated in these row groups, we can skip much data's de/serialization and de/compression. That brings giant improvement.
+
+## Implementation
+
+### How partial update works
+
+a. Add one more member variable(Integer rowGroupId) into the class HoodieRecordLocation.
+
+<pre class="md-fences md-end-block ty-contain-cm modeLoaded" spellcheck="false" lang="java" cid="n35" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: var(--monospace); font-size: 0.9em; display: block; break-inside: avoid; text-align: left; white-space: normal; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(248, 248, 248); position: relative !important; border: 1px solid rgb(231, 234, 237); border-radius: 3px; padding: 8px 4px 6px; margin-bottom: 15px; margin-top: 15px; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-thickness: initial; text-dec
 oration-style: initial; text-decoration-color: initial;">public class HoodieRecordLocation implements Serializable {
+ protected String instantTime;
+ protected String fileId;
+ /**
+ * the index of key in parquet rowGroup num.
+ */
+ protected Integer rowGroupNum;
+ }</pre>
+
+b. Number of rowgroup of a Parquet starts from 0 which continously increases util BlockSize reaches `hoodie.parquet.block.size`.  Since every record in parquet belongs to a rowgroup, we can simply use parquet API to locate rowgroup num of new record which needs to be written into corresponding parquet file, and then record rowgroup num into hoodieRecordLocation of each hoodieRecord.  HoodieRecordLocations will be collected into WriteStatus which will be updated to the index on batch.
+
+c. At phase of tagging index, rowgroup num will be queried out, so that they can be used to accelerate updating files.
+
+Concrete flow of upserting is as below:
+
+![4.jpg](4.jpg)
+
+### steps of writing a parquet file on cow
+
+1.  (upserting) data preparing
+
+    At phase of tag indexing, find out `HoodieRecord.currentLocation.rowGroupNum` of updating records, if rowgroup num is empty, record does implicitly not exists, which means current operation is a INSERT, otherwise DELETE or UPDATE. At next, rowgroup nums are used to make grouping by of the updating records so as to collect all rowgroups which should be updated.
+
+![1.jpg](1.jpg)
+
+2.  rowgroup updating
+
+    The process of updating rowgroup is divided into 5 steps.
+
+    1.  Deserializing and decompressing the columns which need to be combined and assembled into a List<Pari<rowKey,Pari<offset,record>> structure,   where offset represents record's row number in rowgroup(every rowgroup's row number starts with zero).
+
+    2.  Using HoodieRecordPayload#getInsertValue to deserialize the upserting data, then invoking HoodieRecordPayload#getInsertValue to combine the updating rows. 
+
+    3.  Converting combined data into column structure, just like `[{"name":"zs","age":10},{"name":"ls","age":20}] ==> {"name":["sz","ls"],"age":[10,20]}`

Review Comment:
   I would assume w/ impl, we will decide whether to take this path depending on the payload impl used. we don't want to incur additional overhead for the ones which may not be effective (for eg OverwriteWithLatestAvroPayload, DefaultHoodieRecordPayload)



##########
rfc/rfc-58/rfc-58.md:
##########
@@ -0,0 +1,100 @@
+<!--  Licensed to the Apache Software Foundation (ASF) under one or more  contributor license agreements. See the NOTICE file distributed with  this work for additional information regarding copyright ownership.  The ASF licenses this file to You under the Apache License, Version 2.0  (the "License"); you may not use this file except in compliance with  the License. You may obtain a copy of the License at
+
+<pre class="md-fences md-end-block ty-contain-cm modeLoaded" spellcheck="false" lang="" cid="n2" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: var(--monospace); font-size: 0.9em; display: block; break-inside: avoid; text-align: left; white-space: normal; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(248, 248, 248); position: relative !important; border: 1px solid rgb(231, 234, 237); border-radius: 3px; padding: 8px 4px 6px; margin-bottom: 15px; margin-top: 15px; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-thickness: initial; text-decorati
 on-style: initial; text-decoration-color: initial;"> http://www.apache.org/licenses/LICENSE-2.0</pre>
+
+Unless required by applicable law or agreed to in writing, software  distributed under the License is distributed on an "AS IS" BASIS,  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.  See the License for the specific language governing permissions and  limitations under the License.  -->
+
+# RFC-58: A more effective HoodieMergeHandler for COW table with parquet
+
+## Proposers
+
+*   @loukey-lj
+
+## Approvers
+
+*   @<approver1 github username>
+
+*   @<approver2 github username>
+
+## Status
+
+JIRA: https://issues.apache.org/jira/browse/HUDI-4790
+
+> Please keep the status updated in `rfc/README.md`.
+
+## Abstract
+
+To provide a more effective HoodieMergeHandler for COW table with parquet. Hudi rewrite whole parquet file every COW, that costs a lot in De/serialization and De/compression.  To decrease this cost, a 'surgery' is introduced, which rebuilds a new parquet from an old one,  just copying unchanged rowGroups and overwriting changed rowGroups when updating parquet files.
+
+## Background
+
+*   Parquet is a columnar storage format. All data is horizontally divided into row groups. A row group contains the column chunks of all columns in the interval corresponding to this row group. A column chunk is composed of pages, which are compression and coding units. 
+
+*   In current version of Hudi, a complex De/serialization and De/compression happens every time upserting long tail data on COW, which causes giant CPU/IO cost.
+
+*   The purpose of current RFC aims to decrease costs of De/serialization and De/compression in upserting.  Try to think about the reality, if we know which row groups need to be updated and even more the columns need to be updated in these row groups, we can skip much data's de/serialization and de/compression. That brings giant improvement.

Review Comment:
   So, this could be effective only incase of partial updates? In other words, for most commonly used payloads like OverwriteWithLatestAvroPayload, DefaultHoodieRecordPayload etc, this might cause unnecessary overhead right? 
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nsivabalan commented on pull request #6612: [HUDI-4790] a more effective HoodieMergeHandler for COW table with parquet

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on PR #6612:
URL: https://github.com/apache/hudi/pull/6612#issuecomment-1238786159

   can you please fill in PR description and template. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6612: [HUDI-4790] a more effective HoodieMergeHandler for COW table with parquet

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6612:
URL: https://github.com/apache/hudi/pull/6612#issuecomment-1238934239

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "a990d7b411e5692568e548f4b31394f1fd051e77",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11201",
       "triggerID" : "a990d7b411e5692568e548f4b31394f1fd051e77",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * a990d7b411e5692568e548f4b31394f1fd051e77 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11201) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] [HUDI-4790][RFC-68] a more effective HoodieMergeHandler for COW table with parquet [hudi]

Posted by "waitingF (via GitHub)" <gi...@apache.org>.
waitingF commented on PR #6612:
URL: https://github.com/apache/hudi/pull/6612#issuecomment-1870146637

   @loukey-lj @yihua hi, any progress on this improvement? very look forword to this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] loukey-lj commented on pull request #6612: [RFC-58][HUDI-4790] a more effective HoodieMergeHandler for COW table with parquet

Posted by GitBox <gi...@apache.org>.
loukey-lj commented on PR #6612:
URL: https://github.com/apache/hudi/pull/6612#issuecomment-1362850975

   From this class, maybe you can have a general understanding of the parquet partial update implementation
   https://github.com/loukey-lj/hudi/tree/partial-update
   hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodiePartialUpdateHandle.java


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nsivabalan commented on a diff in pull request #6612: [RFC-58][HUDI-4790] a more effective HoodieMergeHandler for COW table with parquet

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on code in PR #6612:
URL: https://github.com/apache/hudi/pull/6612#discussion_r1060262873


##########
rfc/rfc-58/rfc-58.md:
##########
@@ -0,0 +1,100 @@
+<!--  Licensed to the Apache Software Foundation (ASF) under one or more  contributor license agreements. See the NOTICE file distributed with  this work for additional information regarding copyright ownership.  The ASF licenses this file to You under the Apache License, Version 2.0  (the "License"); you may not use this file except in compliance with  the License. You may obtain a copy of the License at
+
+<pre class="md-fences md-end-block ty-contain-cm modeLoaded" spellcheck="false" lang="" cid="n2" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: var(--monospace); font-size: 0.9em; display: block; break-inside: avoid; text-align: left; white-space: normal; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(248, 248, 248); position: relative !important; border: 1px solid rgb(231, 234, 237); border-radius: 3px; padding: 8px 4px 6px; margin-bottom: 15px; margin-top: 15px; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-thickness: initial; text-decorati
 on-style: initial; text-decoration-color: initial;"> http://www.apache.org/licenses/LICENSE-2.0</pre>
+
+Unless required by applicable law or agreed to in writing, software  distributed under the License is distributed on an "AS IS" BASIS,  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.  See the License for the specific language governing permissions and  limitations under the License.  -->
+
+# RFC-58: A more effective HoodieMergeHandler for COW table with parquet
+
+## Proposers
+
+*   @loukey-lj
+
+## Approvers
+
+*   @<approver1 github username>
+
+*   @<approver2 github username>
+
+## Status
+
+JIRA: https://issues.apache.org/jira/browse/HUDI-4790
+
+> Please keep the status updated in `rfc/README.md`.
+
+## Abstract
+
+To provide a more effective HoodieMergeHandler for COW table with parquet. Hudi rewrite whole parquet file every COW, that costs a lot in De/serialization and De/compression.  To decrease this cost, a 'surgery' is introduced, which rebuilds a new parquet from an old one,  just copying unchanged rowGroups and overwriting changed rowGroups when updating parquet files.
+
+## Background
+
+*   Parquet is a columnar storage format. All data is horizontally divided into row groups. A row group contains the column chunks of all columns in the interval corresponding to this row group. A column chunk is composed of pages, which are compression and coding units. 
+
+*   In current version of Hudi, a complex De/serialization and De/compression happens every time upserting long tail data on COW, which causes giant CPU/IO cost.
+
+*   The purpose of current RFC aims to decrease costs of De/serialization and De/compression in upserting.  Try to think about the reality, if we know which row groups need to be updated and even more the columns need to be updated in these row groups, we can skip much data's de/serialization and de/compression. That brings giant improvement.

Review Comment:
   yeah, SQL merge into uses ExpressionPayload and hence I def see a real benefit. but other payloads, its very much impl dependent as I have explained above. 
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6612: [HUDI-4790] a more effective HoodieMergeHandler for COW table with parquet

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6612:
URL: https://github.com/apache/hudi/pull/6612#issuecomment-1238797328

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "a990d7b411e5692568e548f4b31394f1fd051e77",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11201",
       "triggerID" : "a990d7b411e5692568e548f4b31394f1fd051e77",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * a990d7b411e5692568e548f4b31394f1fd051e77 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11201) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] vinothchandar commented on a diff in pull request #6612: [RFC-58][HUDI-4790] a more effective HoodieMergeHandler for COW table with parquet

Posted by "vinothchandar (via GitHub)" <gi...@apache.org>.
vinothchandar commented on code in PR #6612:
URL: https://github.com/apache/hudi/pull/6612#discussion_r1164843499


##########
rfc/rfc-58/rfc-58.md:
##########
@@ -0,0 +1,100 @@
+<!--  Licensed to the Apache Software Foundation (ASF) under one or more  contributor license agreements. See the NOTICE file distributed with  this work for additional information regarding copyright ownership.  The ASF licenses this file to You under the Apache License, Version 2.0  (the "License"); you may not use this file except in compliance with  the License. You may obtain a copy of the License at
+
+<pre class="md-fences md-end-block ty-contain-cm modeLoaded" spellcheck="false" lang="" cid="n2" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: var(--monospace); font-size: 0.9em; display: block; break-inside: avoid; text-align: left; white-space: normal; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(248, 248, 248); position: relative !important; border: 1px solid rgb(231, 234, 237); border-radius: 3px; padding: 8px 4px 6px; margin-bottom: 15px; margin-top: 15px; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-thickness: initial; text-decorati
 on-style: initial; text-decoration-color: initial;"> http://www.apache.org/licenses/LICENSE-2.0</pre>
+
+Unless required by applicable law or agreed to in writing, software  distributed under the License is distributed on an "AS IS" BASIS,  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.  See the License for the specific language governing permissions and  limitations under the License.  -->
+
+# RFC-58: A more effective HoodieMergeHandler for COW table with parquet
+
+## Proposers
+
+*   @loukey-lj
+
+## Approvers
+
+*   @<approver1 github username>
+
+*   @<approver2 github username>
+
+## Status
+
+JIRA: https://issues.apache.org/jira/browse/HUDI-4790
+
+> Please keep the status updated in `rfc/README.md`.
+
+## Abstract
+
+To provide a more effective HoodieMergeHandler for COW table with parquet. Hudi rewrite whole parquet file every COW, that costs a lot in De/serialization and De/compression.  To decrease this cost, a 'surgery' is introduced, which rebuilds a new parquet from an old one,  just copying unchanged rowGroups and overwriting changed rowGroups when updating parquet files.

Review Comment:
   Two questions:
   
   a) is there a way to copy over unchanged columns as well within each row group? or do this at the page level?  
   
   b) IIUC I think this helps in cases where the parquet file has multiple row groups and only few of them are changed? would you expect to see any performance improvements with the default 120 MB file size, with 120MB block size? i.e with just one row group in the parquet file



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] yihua commented on pull request #6612: [HUDI-4790][RFC-68] a more effective HoodieMergeHandler for COW table with parquet

Posted by "yihua (via GitHub)" <gi...@apache.org>.
yihua commented on PR #6612:
URL: https://github.com/apache/hudi/pull/6612#issuecomment-1569196465

   Hi @loukey-lj thanks for putting up the RFC and the great ideas on improving the write performance in Hudi!  I'll merge this RFC now.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org