You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/05/05 02:32:31 UTC

[GitHub] [hudi] alexeykudinkin opened a new pull request, #5497: [WIP] Avoid calling `getDataSize` after every record written

alexeykudinkin opened a new pull request, #5497:
URL: https://github.com/apache/hudi/pull/5497

   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contribute/how-to-contribute before opening a pull request.*
   
   ## What is the purpose of the pull request
   
   `getDataSize` has non-trivial overhead in the current `ParquetWriter` impl, requiring traversal of already composed Column Groups in memory. Instead we can sample these calls to `getDataSize` to amortize its cost.
   
   ## Brief change log
   
    - Sample memory checks of the currently written output size to avoid excessive block traversals
    - Extracted HoodieBaseParquetWriter encapsulating shared functionality b/w `ParquetWriter` impls
   
   ## Verify this pull request
   
   This pull request is already covered by existing tests, such as *(please describe tests)*.
   
   ## Committer checklist
   
    - [ ] Has a corresponding JIRA in PR title & commit
    
    - [ ] Commit message is descriptive of the change
    
    - [ ] CI is green
   
    - [ ] Necessary doc changes done or have another open PR
          
    - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] vinothchandar commented on a diff in pull request #5497: [HUDI-4038] Avoid calling `getDataSize` after every record written

Posted by GitBox <gi...@apache.org>.

vinothchandar commented on code in PR #5497:
URL: https://github.com/apache/hudi/pull/5497#discussion_r929388938


##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/storage/HoodieBaseParquetWriter.java:
##########
@@ -0,0 +1,87 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.io.storage;
+
+import org.apache.hadoop.fs.Path;
+import org.apache.hudi.common.fs.FSUtils;
+import org.apache.hudi.common.fs.HoodieWrapperFileSystem;
+import org.apache.parquet.hadoop.ParquetFileWriter;
+import org.apache.parquet.hadoop.ParquetWriter;
+import org.apache.parquet.hadoop.api.WriteSupport;
+
+import java.io.IOException;
+import java.util.concurrent.atomic.AtomicLong;
+
+/**
+ * Base class of Hudi's custom {@link ParquetWriter} implementations
+ *
+ * @param <R> target type of the object being written into Parquet files (for ex,
+ *           {@code IndexedRecord}, {@code InternalRow})
+ */
+public abstract class HoodieBaseParquetWriter<R> extends ParquetWriter<R> {
+
+  private static final int WRITTEN_RECORDS_THRESHOLD_FOR_FILE_SIZE_CHECK = 1000;
+
+  private final AtomicLong writtenRecordCount = new AtomicLong(0);
+  private final long maxFileSize;
+  private long lastCachedDataSize = -1;
+
+  public HoodieBaseParquetWriter(Path file,
+                                 HoodieBaseParquetConfig<? extends WriteSupport<R>> parquetConfig) throws IOException {
+    super(HoodieWrapperFileSystem.convertToHoodiePath(file, parquetConfig.getHadoopConf()),
+        ParquetFileWriter.Mode.CREATE,
+        parquetConfig.getWriteSupport(),
+        parquetConfig.getCompressionCodecName(),
+        parquetConfig.getBlockSize(),
+        parquetConfig.getPageSize(),
+        parquetConfig.getPageSize(),
+        parquetConfig.dictionaryEnabled(),
+        DEFAULT_IS_VALIDATING_ENABLED,
+        DEFAULT_WRITER_VERSION,
+        FSUtils.registerFileSystem(file, parquetConfig.getHadoopConf()));
+
+    // We cannot accurately measure the snappy compressed output file size. We are choosing a
+    // conservative 10%
+    // TODO - compute this compression ratio dynamically by looking at the bytes written to the
+    // stream and the actual file size reported by HDFS
+    this.maxFileSize = parquetConfig.getMaxFileSize()
+        + Math.round(parquetConfig.getMaxFileSize() * parquetConfig.getCompressionRatio());
+  }
+
+  public boolean canWrite() {
+    // TODO we can actually do evaluation more accurately:

Review Comment:
   +1 to the overall idea. but here is the deal - the size may not update until a row group is actually flushed out to storage. so `getDataSize()` simply returns `0` until then. This on the fly file sizing is only useful for large files with multiple blocks/row groups. This is the behavior back in the day.
   
   What pattern did you observe on writes to S3? is the `getDataSize()` real-time i.e reflect the last write's size between subsequent calls?
   
   I see the code here. which should respect the buffered data?
   
   ```
     /**
      * @return the total size of data written to the file and buffered in memory
      */
     public long getDataSize() {
       return lastRowGroupEndPos + columnStore.getBufferedSize();
     }
   ```
   



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/storage/HoodieBaseParquetWriter.java:
##########
@@ -0,0 +1,87 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.io.storage;
+
+import org.apache.hadoop.fs.Path;
+import org.apache.hudi.common.fs.FSUtils;
+import org.apache.hudi.common.fs.HoodieWrapperFileSystem;
+import org.apache.parquet.hadoop.ParquetFileWriter;
+import org.apache.parquet.hadoop.ParquetWriter;
+import org.apache.parquet.hadoop.api.WriteSupport;
+
+import java.io.IOException;
+import java.util.concurrent.atomic.AtomicLong;
+
+/**
+ * Base class of Hudi's custom {@link ParquetWriter} implementations
+ *
+ * @param <R> target type of the object being written into Parquet files (for ex,
+ *           {@code IndexedRecord}, {@code InternalRow})
+ */
+public abstract class HoodieBaseParquetWriter<R> extends ParquetWriter<R> {
+
+  private static final int WRITTEN_RECORDS_THRESHOLD_FOR_FILE_SIZE_CHECK = 1000;
+
+  private final AtomicLong writtenRecordCount = new AtomicLong(0);
+  private final long maxFileSize;
+  private long lastCachedDataSize = -1;
+
+  public HoodieBaseParquetWriter(Path file,
+                                 HoodieBaseParquetConfig<? extends WriteSupport<R>> parquetConfig) throws IOException {
+    super(HoodieWrapperFileSystem.convertToHoodiePath(file, parquetConfig.getHadoopConf()),
+        ParquetFileWriter.Mode.CREATE,
+        parquetConfig.getWriteSupport(),
+        parquetConfig.getCompressionCodecName(),
+        parquetConfig.getBlockSize(),
+        parquetConfig.getPageSize(),
+        parquetConfig.getPageSize(),
+        parquetConfig.dictionaryEnabled(),
+        DEFAULT_IS_VALIDATING_ENABLED,
+        DEFAULT_WRITER_VERSION,
+        FSUtils.registerFileSystem(file, parquetConfig.getHadoopConf()));
+
+    // We cannot accurately measure the snappy compressed output file size. We are choosing a
+    // conservative 10%
+    // TODO - compute this compression ratio dynamically by looking at the bytes written to the
+    // stream and the actual file size reported by HDFS
+    this.maxFileSize = parquetConfig.getMaxFileSize()
+        + Math.round(parquetConfig.getMaxFileSize() * parquetConfig.getCompressionRatio());
+  }
+
+  public boolean canWrite() {
+    // TODO we can actually do evaluation more accurately:
+    //      if we cache last data size check, since we account for how many records

Review Comment:
   Can we file a JIRA for this follow on work, after verifying the realtime ness of the `getDataSize()` 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #5497: [HUDI-4038] Avoid calling `getDataSize` after every record written

Posted by GitBox <gi...@apache.org>.

alexeykudinkin commented on code in PR #5497:
URL: https://github.com/apache/hudi/pull/5497#discussion_r929406153


##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/storage/HoodieBaseParquetWriter.java:
##########
@@ -0,0 +1,87 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.io.storage;
+
+import org.apache.hadoop.fs.Path;
+import org.apache.hudi.common.fs.FSUtils;
+import org.apache.hudi.common.fs.HoodieWrapperFileSystem;
+import org.apache.parquet.hadoop.ParquetFileWriter;
+import org.apache.parquet.hadoop.ParquetWriter;
+import org.apache.parquet.hadoop.api.WriteSupport;
+
+import java.io.IOException;
+import java.util.concurrent.atomic.AtomicLong;
+
+/**
+ * Base class of Hudi's custom {@link ParquetWriter} implementations
+ *
+ * @param <R> target type of the object being written into Parquet files (for ex,
+ *           {@code IndexedRecord}, {@code InternalRow})
+ */
+public abstract class HoodieBaseParquetWriter<R> extends ParquetWriter<R> {
+
+  private static final int WRITTEN_RECORDS_THRESHOLD_FOR_FILE_SIZE_CHECK = 1000;
+
+  private final AtomicLong writtenRecordCount = new AtomicLong(0);
+  private final long maxFileSize;
+  private long lastCachedDataSize = -1;
+
+  public HoodieBaseParquetWriter(Path file,
+                                 HoodieBaseParquetConfig<? extends WriteSupport<R>> parquetConfig) throws IOException {
+    super(HoodieWrapperFileSystem.convertToHoodiePath(file, parquetConfig.getHadoopConf()),
+        ParquetFileWriter.Mode.CREATE,
+        parquetConfig.getWriteSupport(),
+        parquetConfig.getCompressionCodecName(),
+        parquetConfig.getBlockSize(),
+        parquetConfig.getPageSize(),
+        parquetConfig.getPageSize(),
+        parquetConfig.dictionaryEnabled(),
+        DEFAULT_IS_VALIDATING_ENABLED,
+        DEFAULT_WRITER_VERSION,
+        FSUtils.registerFileSystem(file, parquetConfig.getHadoopConf()));
+
+    // We cannot accurately measure the snappy compressed output file size. We are choosing a
+    // conservative 10%
+    // TODO - compute this compression ratio dynamically by looking at the bytes written to the
+    // stream and the actual file size reported by HDFS
+    this.maxFileSize = parquetConfig.getMaxFileSize()
+        + Math.round(parquetConfig.getMaxFileSize() * parquetConfig.getCompressionRatio());
+  }
+
+  public boolean canWrite() {
+    // TODO we can actually do evaluation more accurately:

Review Comment:
   > +1 to the overall idea. but here is the deal - the size may not update until a row group is actually flushed out to storage. so getDataSize() simply returns 0 until then.
   
   It won't: it always returns accurate metric, b/c
   
   1. It keeps track how many bytes were written (`lastRowGroupEndPos`)
   2. It calculates the actual buffered footprint (`columnStore.getBufferedSize()`)
   
   With the second being the problem -- it always traverse all of the cached all groups to accurately calculate the in-memory footprint (and there's no internal caching). So what ended up happening it kept growing the buffer for the whole file (120Mb) not flushing in until closure which was making traversals quadratic in runtime.



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/storage/HoodieBaseParquetWriter.java:
##########
@@ -0,0 +1,87 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.io.storage;
+
+import org.apache.hadoop.fs.Path;
+import org.apache.hudi.common.fs.FSUtils;
+import org.apache.hudi.common.fs.HoodieWrapperFileSystem;
+import org.apache.parquet.hadoop.ParquetFileWriter;
+import org.apache.parquet.hadoop.ParquetWriter;
+import org.apache.parquet.hadoop.api.WriteSupport;
+
+import java.io.IOException;
+import java.util.concurrent.atomic.AtomicLong;
+
+/**
+ * Base class of Hudi's custom {@link ParquetWriter} implementations
+ *
+ * @param <R> target type of the object being written into Parquet files (for ex,
+ *           {@code IndexedRecord}, {@code InternalRow})
+ */
+public abstract class HoodieBaseParquetWriter<R> extends ParquetWriter<R> {
+
+  private static final int WRITTEN_RECORDS_THRESHOLD_FOR_FILE_SIZE_CHECK = 1000;
+
+  private final AtomicLong writtenRecordCount = new AtomicLong(0);
+  private final long maxFileSize;
+  private long lastCachedDataSize = -1;
+
+  public HoodieBaseParquetWriter(Path file,
+                                 HoodieBaseParquetConfig<? extends WriteSupport<R>> parquetConfig) throws IOException {
+    super(HoodieWrapperFileSystem.convertToHoodiePath(file, parquetConfig.getHadoopConf()),
+        ParquetFileWriter.Mode.CREATE,
+        parquetConfig.getWriteSupport(),
+        parquetConfig.getCompressionCodecName(),
+        parquetConfig.getBlockSize(),
+        parquetConfig.getPageSize(),
+        parquetConfig.getPageSize(),
+        parquetConfig.dictionaryEnabled(),
+        DEFAULT_IS_VALIDATING_ENABLED,
+        DEFAULT_WRITER_VERSION,
+        FSUtils.registerFileSystem(file, parquetConfig.getHadoopConf()));
+
+    // We cannot accurately measure the snappy compressed output file size. We are choosing a
+    // conservative 10%
+    // TODO - compute this compression ratio dynamically by looking at the bytes written to the
+    // stream and the actual file size reported by HDFS
+    this.maxFileSize = parquetConfig.getMaxFileSize()
+        + Math.round(parquetConfig.getMaxFileSize() * parquetConfig.getCompressionRatio());
+  }
+
+  public boolean canWrite() {
+    // TODO we can actually do evaluation more accurately:
+    //      if we cache last data size check, since we account for how many records

Review Comment:
   I validated that `getDataSize` returns accurate results for buffered data



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #5497: [HUDI-4038] Avoid calling `getDataSize` after every record written

Posted by GitBox <gi...@apache.org>.

hudi-bot commented on PR #5497:
URL: https://github.com/apache/hudi/pull/5497#issuecomment-1122848568

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "b3a35b349472047355f36405abf442434e73f66c",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8428",
       "triggerID" : "b3a35b349472047355f36405abf442434e73f66c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "56fc702765425825d762a9367a49fcb83d2effbd",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "56fc702765425825d762a9367a49fcb83d2effbd",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * b3a35b349472047355f36405abf442434e73f66c Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8428) 
   * 56fc702765425825d762a9367a49fcb83d2effbd UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #5497: [HUDI-4038] Avoid calling `getDataSize` after every record written

Posted by GitBox <gi...@apache.org>.

hudi-bot commented on PR #5497:
URL: https://github.com/apache/hudi/pull/5497#issuecomment-1123091016

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "b3a35b349472047355f36405abf442434e73f66c",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8428",
       "triggerID" : "b3a35b349472047355f36405abf442434e73f66c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "56fc702765425825d762a9367a49fcb83d2effbd",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8554",
       "triggerID" : "56fc702765425825d762a9367a49fcb83d2effbd",
       "triggerType" : "PUSH"
     }, {
       "hash" : "750c4ea998fdc35796407f033a9e139cec1dd5f3",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8558",
       "triggerID" : "750c4ea998fdc35796407f033a9e139cec1dd5f3",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 750c4ea998fdc35796407f033a9e139cec1dd5f3 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8558) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan merged pull request #5497: [HUDI-4038] Avoid calling `getDataSize` after every record written

Posted by GitBox <gi...@apache.org>.

nsivabalan merged PR #5497:
URL: https://github.com/apache/hudi/pull/5497


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #5497: [HUDI-4038] Avoid calling `getDataSize` after every record written

Posted by GitBox <gi...@apache.org>.

hudi-bot commented on PR #5497:
URL: https://github.com/apache/hudi/pull/5497#issuecomment-1118113724

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "b3a35b349472047355f36405abf442434e73f66c",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8428",
       "triggerID" : "b3a35b349472047355f36405abf442434e73f66c",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * b3a35b349472047355f36405abf442434e73f66c Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8428) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #5497: [WIP] Avoid calling `getDataSize` after every record written

Posted by GitBox <gi...@apache.org>.

hudi-bot commented on PR #5497:
URL: https://github.com/apache/hudi/pull/5497#issuecomment-1118109102

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "b3a35b349472047355f36405abf442434e73f66c",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "b3a35b349472047355f36405abf442434e73f66c",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * b3a35b349472047355f36405abf442434e73f66c UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #5497: [HUDI-4038] Avoid calling `getDataSize` after every record written

Posted by GitBox <gi...@apache.org>.

hudi-bot commented on PR #5497:
URL: https://github.com/apache/hudi/pull/5497#issuecomment-1123062729

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "b3a35b349472047355f36405abf442434e73f66c",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8428",
       "triggerID" : "b3a35b349472047355f36405abf442434e73f66c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "56fc702765425825d762a9367a49fcb83d2effbd",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8554",
       "triggerID" : "56fc702765425825d762a9367a49fcb83d2effbd",
       "triggerType" : "PUSH"
     }, {
       "hash" : "750c4ea998fdc35796407f033a9e139cec1dd5f3",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8558",
       "triggerID" : "750c4ea998fdc35796407f033a9e139cec1dd5f3",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 56fc702765425825d762a9367a49fcb83d2effbd Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8554) 
   * 750c4ea998fdc35796407f033a9e139cec1dd5f3 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8558) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on pull request #5497: [HUDI-4038] Avoid calling `getDataSize` after every record written

Posted by GitBox <gi...@apache.org>.

nsivabalan commented on PR #5497:
URL: https://github.com/apache/hudi/pull/5497#issuecomment-1122829331

   pushed out a commit to address my feedback. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #5497: [HUDI-4038] Avoid calling `getDataSize` after every record written

Posted by GitBox <gi...@apache.org>.

hudi-bot commented on PR #5497:
URL: https://github.com/apache/hudi/pull/5497#issuecomment-1122851250

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "b3a35b349472047355f36405abf442434e73f66c",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8428",
       "triggerID" : "b3a35b349472047355f36405abf442434e73f66c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "56fc702765425825d762a9367a49fcb83d2effbd",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8554",
       "triggerID" : "56fc702765425825d762a9367a49fcb83d2effbd",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * b3a35b349472047355f36405abf442434e73f66c Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8428) 
   * 56fc702765425825d762a9367a49fcb83d2effbd Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8554) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on a diff in pull request #5497: [HUDI-4038] Avoid calling `getDataSize` after every record written

Posted by GitBox <gi...@apache.org>.

nsivabalan commented on code in PR #5497:
URL: https://github.com/apache/hudi/pull/5497#discussion_r869637656


##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/storage/HoodieBaseParquetWriter.java:
##########
@@ -0,0 +1,86 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.io.storage;
+
+import org.apache.hadoop.fs.Path;
+import org.apache.hudi.common.fs.FSUtils;
+import org.apache.hudi.common.fs.HoodieWrapperFileSystem;
+import org.apache.parquet.hadoop.ParquetFileWriter;
+import org.apache.parquet.hadoop.ParquetWriter;
+import org.apache.parquet.hadoop.api.WriteSupport;
+
+import java.io.IOException;
+import java.util.concurrent.atomic.AtomicLong;
+
+/**
+ * Base class of Hudi's custom {@link ParquetWriter} implementations
+ *
+ * @param <R> target type of the object being written into Parquet files (for ex,
+ *           {@code IndexedRecord}, {@code InternalRow})
+ */
+public abstract class HoodieBaseParquetWriter<R> extends ParquetWriter<R> {
+
+  private static final int WRITTEN_RECORDS_THRESHOLD_FOR_FILE_SIZE_CHECK = 1000;
+
+  private final AtomicLong writtenRecordCount = new AtomicLong(1);
+
+  private final long maxFileSize;
+
+  private long lastCachedDataSize = -1;

Review Comment:
   not used.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #5497: [HUDI-4038] Avoid calling `getDataSize` after every record written

Posted by GitBox <gi...@apache.org>.

hudi-bot commented on PR #5497:
URL: https://github.com/apache/hudi/pull/5497#issuecomment-1123060159

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "b3a35b349472047355f36405abf442434e73f66c",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8428",
       "triggerID" : "b3a35b349472047355f36405abf442434e73f66c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "56fc702765425825d762a9367a49fcb83d2effbd",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8554",
       "triggerID" : "56fc702765425825d762a9367a49fcb83d2effbd",
       "triggerType" : "PUSH"
     }, {
       "hash" : "750c4ea998fdc35796407f033a9e139cec1dd5f3",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "750c4ea998fdc35796407f033a9e139cec1dd5f3",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 56fc702765425825d762a9367a49fcb83d2effbd Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8554) 
   * 750c4ea998fdc35796407f033a9e139cec1dd5f3 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on pull request #5497: [HUDI-4038] Avoid calling `getDataSize` after every record written

Posted by GitBox <gi...@apache.org>.

nsivabalan commented on PR #5497:
URL: https://github.com/apache/hudi/pull/5497#issuecomment-1123664377

   CI succeeded https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=8558&view=results
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #5497: [HUDI-4038] Avoid calling `getDataSize` after every record written

Posted by GitBox <gi...@apache.org>.

hudi-bot commented on PR #5497:
URL: https://github.com/apache/hudi/pull/5497#issuecomment-1122856555

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "b3a35b349472047355f36405abf442434e73f66c",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8428",
       "triggerID" : "b3a35b349472047355f36405abf442434e73f66c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "56fc702765425825d762a9367a49fcb83d2effbd",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8554",
       "triggerID" : "56fc702765425825d762a9367a49fcb83d2effbd",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 56fc702765425825d762a9367a49fcb83d2effbd Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8554) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #5497: [WIP] Avoid calling `getDataSize` after every record written

Posted by GitBox <gi...@apache.org>.

hudi-bot commented on PR #5497:
URL: https://github.com/apache/hudi/pull/5497#issuecomment-1118110328

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "b3a35b349472047355f36405abf442434e73f66c",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8428",
       "triggerID" : "b3a35b349472047355f36405abf442434e73f66c",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * b3a35b349472047355f36405abf442434e73f66c Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8428) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org