You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2020/01/19 11:57:07 UTC

[GitHub] [incubator-hudi] bvaradar opened a new pull request #1253: [HUDI-558] Introduce ability to compress bloom filters while storing in parquet

bvaradar opened a new pull request #1253: [HUDI-558] Introduce ability to compress bloom filters while storing in parquet
URL: https://github.com/apache/incubator-hudi/pull/1253
 
 
   
   ## What is the purpose of the pull request
   
   * Introduce ability to compress bloom filters while storing in parquet
   
   ## Verify this pull request
   
   This change added tests and can be verified as follows:
   
   ## Committer checklist
   
    - [X ] Has a corresponding JIRA in PR title & commit
    
    - [ X] Commit message is descriptive of the change
    
    - [ ] CI is green
   
    - [ ] Necessary doc changes done or have another open PR
          
    - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] lamber-ken commented on issue #1253: [HUDI-558] Introduce ability to compress bloom filters while storing in parquet

Posted by GitBox <gi...@apache.org>.
lamber-ken commented on issue #1253: [HUDI-558] Introduce ability to compress bloom filters while storing in parquet
URL: https://github.com/apache/incubator-hudi/pull/1253#issuecomment-592649734
 
 
   Hi @vinothchandar
   
   > we can only place strings inside the parquet footers
   Right, I know it. 
   
   `byte[]` -> `base64 string` -> `byte[]` unnecessary steps
   ![image](https://user-images.githubusercontent.com/20113411/75572628-1b6ddb00-5a96-11ea-8e8d-e66cd3883db8.png)
   
   ### What I want to say is
   ![image](https://user-images.githubusercontent.com/20113411/75572733-4e17d380-5a96-11ea-82c8-593e2083507d.png)
   
   
   
   
   
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] nsivabalan commented on issue #1253: [HUDI-558] Introduce ability to compress bloom filters while storing in parquet

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #1253: [HUDI-558] Introduce ability to compress bloom filters while storing in parquet
URL: https://github.com/apache/incubator-hudi/pull/1253#issuecomment-592861991
 
 
   Good point @lamber-ken on avoiding conversions. 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] lamber-ken commented on issue #1253: [HUDI-558] Introduce ability to compress bloom filters while storing in parquet

Posted by GitBox <gi...@apache.org>.
lamber-ken commented on issue #1253: [HUDI-558] Introduce ability to compress bloom filters while storing in parquet
URL: https://github.com/apache/incubator-hudi/pull/1253#issuecomment-592880451
 
 
   Also, the compressed byte[] data seems bigger than original one.
   ```
   test random keys
   Data origin byte[] length Stage 0 3594410
   Data compress byte[] length Stage 1 3630217
   
   test sequential keys
   Data origin byte[] length Stage 0 3594410
   Data compress byte[] length Stage 1 3630276
   ```
   
   Added these log statements.
   ![image](https://user-images.githubusercontent.com/20113411/75601357-33287c00-5af5-11ea-8e3e-c270a734eadc.png)
   
   ![image](https://user-images.githubusercontent.com/20113411/75601365-41769800-5af5-11ea-8baa-e345a075b8be.png)
   
   
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] bvaradar commented on issue #1253: [HUDI-558] Introduce ability to compress bloom filters while storing in parquet

Posted by GitBox <gi...@apache.org>.
bvaradar commented on issue #1253: [HUDI-558] Introduce ability to compress bloom filters while storing in parquet
URL: https://github.com/apache/incubator-hudi/pull/1253#issuecomment-593571838
 
 
   Marking as WIP for now. 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] bvaradar commented on a change in pull request #1253: [HUDI-558] Introduce ability to compress bloom filters while storing in parquet

Posted by GitBox <gi...@apache.org>.
bvaradar commented on a change in pull request #1253: [HUDI-558] Introduce ability to compress bloom filters while storing in parquet
URL: https://github.com/apache/incubator-hudi/pull/1253#discussion_r384915133
 
 

 ##########
 File path: hudi-common/src/test/java/org/apache/hudi/common/util/TestParquetUtils.java
 ##########
 @@ -132,7 +132,7 @@ private void writeParquetFile(String filePath, List<String> rowKeys) throws Exce
     BloomFilter filter = BloomFilterFactory
         .createBloomFilter(1000, 0.0001, 10000, bloomFilterTypeToTest);
     HoodieAvroWriteSupport writeSupport =
-        new HoodieAvroWriteSupport(new AvroSchemaConverter().convert(schema), schema, filter);
+        new HoodieAvroWriteSupport(new AvroSchemaConverter().convert(schema), schema, filter, false);
 
 Review comment:
   Added

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] smarthi commented on a change in pull request #1253: [HUDI-558] Introduce ability to compress bloom filters while storing in parquet

Posted by GitBox <gi...@apache.org>.
smarthi commented on a change in pull request #1253: [HUDI-558] Introduce ability to compress bloom filters while storing in parquet
URL: https://github.com/apache/incubator-hudi/pull/1253#discussion_r368711970
 
 

 ##########
 File path: hudi-cli/src/main/scala/org/apache/hudi/cli/SparkHelpers.scala
 ##########
 @@ -43,7 +43,7 @@ object SparkHelpers {
     val schema: Schema = sourceRecords.get(0).getSchema
     val filter: BloomFilter = BloomFilterFactory.createBloomFilter(HoodieIndexConfig.DEFAULT_BLOOM_FILTER_NUM_ENTRIES.toInt, HoodieIndexConfig.DEFAULT_BLOOM_FILTER_FPP.toDouble,
       HoodieIndexConfig.DEFAULT_HOODIE_BLOOM_INDEX_FILTER_DYNAMIC_MAX_ENTRIES.toInt, HoodieIndexConfig.DEFAULT_BLOOM_INDEX_FILTER_TYPE);
-    val writeSupport: HoodieAvroWriteSupport = new HoodieAvroWriteSupport(new AvroSchemaConverter().convert(schema), schema, filter)
+    val writeSupport: HoodieAvroWriteSupport = new HoodieAvroWriteSupport(new AvroSchemaConverter().convert(schema), schema, filter, java.lang.Boolean.valueOf(HoodieIndexConfig.BLOOM_INDEX_ENABLE_COMPRESSION))
 
 Review comment:
   replace Boolean.valueOf() with Boolean.parseBoolean().

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] bvaradar commented on a change in pull request #1253: [HUDI-558] Introduce ability to compress bloom filters while storing in parquet

Posted by GitBox <gi...@apache.org>.
bvaradar commented on a change in pull request #1253: [HUDI-558] Introduce ability to compress bloom filters while storing in parquet
URL: https://github.com/apache/incubator-hudi/pull/1253#discussion_r384915088
 
 

 ##########
 File path: hudi-common/src/test/java/org/apache/hudi/common/util/TestGzipCompressionUtils.java
 ##########
 @@ -0,0 +1,51 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.util;
+
+import org.apache.hudi.common.bloom.filter.BloomFilter;
+import org.apache.hudi.common.bloom.filter.SimpleBloomFilter;
+
+import org.apache.hadoop.util.hash.Hash;
+import org.junit.Assert;
+import org.junit.Test;
+
+import java.util.UUID;
+
 
 Review comment:
   As this was a very simple unit-test, I didn't see how annotations would help make the test more compact. Will leave it as such. Let me know if you strongly feel there is a need

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] smarthi commented on a change in pull request #1253: [HUDI-558] Introduce ability to compress bloom filters while storing in parquet

Posted by GitBox <gi...@apache.org>.
smarthi commented on a change in pull request #1253: [HUDI-558] Introduce ability to compress bloom filters while storing in parquet
URL: https://github.com/apache/incubator-hudi/pull/1253#discussion_r368711264
 
 

 ##########
 File path: hudi-client/src/main/java/org/apache/hudi/config/HoodieIndexConfig.java
 ##########
 @@ -166,6 +168,11 @@ public Builder bloomIndexBucketizedChecking(boolean bucketizedChecking) {
       return this;
     }
 
+    public Builder bloomIndexEnableCompression(boolean enableCompression) {
+      props.setProperty(BLOOM_INDEX_ENABLE_COMPRESSION, String.valueOf(enableCompression));
 
 Review comment:
   just call Boolean.toString() instead of String.valueOf()

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] vinothchandar commented on issue #1253: [HUDI-558] Introduce ability to compress bloom filters while storing in parquet

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on issue #1253: [HUDI-558] Introduce ability to compress bloom filters while storing in parquet
URL: https://github.com/apache/incubator-hudi/pull/1253#issuecomment-592610993
 
 
   @lamber-ken we can only place strings inside the parquet footers

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] leesf commented on a change in pull request #1253: [HUDI-558] Introduce ability to compress bloom filters while storing in parquet

Posted by GitBox <gi...@apache.org>.
leesf commented on a change in pull request #1253: [HUDI-558] Introduce ability to compress bloom filters while storing in parquet
URL: https://github.com/apache/incubator-hudi/pull/1253#discussion_r376785469
 
 

 ##########
 File path: hudi-common/src/test/java/org/apache/hudi/common/util/TestGzipCompressionUtils.java
 ##########
 @@ -0,0 +1,51 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.util;
+
+import org.apache.hudi.common.bloom.filter.BloomFilter;
+import org.apache.hudi.common.bloom.filter.SimpleBloomFilter;
+
+import org.apache.hadoop.util.hash.Hash;
+import org.junit.Assert;
+import org.junit.Test;
+
+import java.util.UUID;
+
+public class TestGzipCompressionUtils {
+
+  @Test
+  public void testCompressDeCompress() {
 
 Review comment:
   +1

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] vinothchandar commented on issue #1253: [HUDI-558] Introduce ability to compress bloom filters while storing in parquet

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on issue #1253: [HUDI-558] Introduce ability to compress bloom filters while storing in parquet
URL: https://github.com/apache/incubator-hudi/pull/1253#issuecomment-591720460
 
 
   @bvaradar this would be nice to land, so I can use for perf testing :) 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] nsivabalan commented on a change in pull request #1253: [HUDI-558] Introduce ability to compress bloom filters while storing in parquet

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on a change in pull request #1253: [HUDI-558] Introduce ability to compress bloom filters while storing in parquet
URL: https://github.com/apache/incubator-hudi/pull/1253#discussion_r372198618
 
 

 ##########
 File path: hudi-common/src/test/java/org/apache/hudi/common/util/TestParquetUtils.java
 ##########
 @@ -132,7 +132,7 @@ private void writeParquetFile(String filePath, List<String> rowKeys) throws Exce
     BloomFilter filter = BloomFilterFactory
         .createBloomFilter(1000, 0.0001, 10000, bloomFilterTypeToTest);
     HoodieAvroWriteSupport writeSupport =
-        new HoodieAvroWriteSupport(new AvroSchemaConverter().convert(schema), schema, filter);
+        new HoodieAvroWriteSupport(new AvroSchemaConverter().convert(schema), schema, filter, false);
 
 Review comment:
   don't we need to test both true and false? 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] lamber-ken commented on issue #1253: [HUDI-558] Introduce ability to compress bloom filters while storing in parquet

Posted by GitBox <gi...@apache.org>.
lamber-ken commented on issue #1253: [HUDI-558] Introduce ability to compress bloom filters while storing in parquet
URL: https://github.com/apache/incubator-hudi/pull/1253#issuecomment-592855749
 
 
   Hi @bvaradar, I test the pr, seems that the size of compressed bigger than original one. 
   If wrong, correct me, thanks.
   
   ```
   test random keys
   original size: 4792548
   compress size: 4967672
   
   test sequential keys
   original size: 4792548
   compress size: 4967746
   ```
   ```
   SimpleBloomFilter filter = new SimpleBloomFilter(1000000, 0.000001, Hash.MURMUR_HASH);
   
   System.out.println("test random keys");
   for (int i = 0; i < 1000000; i++) {
     String key = UUID.randomUUID().toString();
     filter.add(key);
   }
   
   System.out.println("original size: " + filter.serializeToString().length());
   System.out.println("compress size: " + GzipCompressionUtils.compress(filter.serializeToString()).length());
   
   System.out.println("\ntest sequential keys");
   filter = new SimpleBloomFilter(1000000, 0.000001, Hash.MURMUR_HASH);
   
   for (int i = 0; i < 1000000; i++) {
     String key = "key-" + i;
     filter.add(key);
   }
   
   System.out.println("original size: " + filter.serializeToString().length());
   System.out.println("compress size: " + GzipCompressionUtils.compress(filter.serializeToString()).length());
   
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] lamber-ken commented on issue #1253: [HUDI-558] Introduce ability to compress bloom filters while storing in parquet

Posted by GitBox <gi...@apache.org>.
lamber-ken commented on issue #1253: [HUDI-558] Introduce ability to compress bloom filters while storing in parquet
URL: https://github.com/apache/incubator-hudi/pull/1253#issuecomment-592397157
 
 
   Hi @bvaradar, 
   Call time line will be: `byte[]` -> `base64 String` -> `gzip stream` -> `base64 String`
   
   ![image](https://user-images.githubusercontent.com/20113411/75521469-cc4a8a80-5a42-11ea-8d92-59b9f845d2d6.png)
   
   IMO, we can use gzip compress `byte[]` data directly, like:
   ![image](https://user-images.githubusercontent.com/20113411/75521938-bb4e4900-5a43-11ea-9399-8a8eeae72692.png)
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] bvaradar commented on a change in pull request #1253: [HUDI-558] Introduce ability to compress bloom filters while storing in parquet

Posted by GitBox <gi...@apache.org>.
bvaradar commented on a change in pull request #1253: [HUDI-558] Introduce ability to compress bloom filters while storing in parquet
URL: https://github.com/apache/incubator-hudi/pull/1253#discussion_r384915114
 
 

 ##########
 File path: hudi-common/src/test/java/org/apache/hudi/common/util/TestGzipCompressionUtils.java
 ##########
 @@ -0,0 +1,51 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.util;
+
+import org.apache.hudi.common.bloom.filter.BloomFilter;
+import org.apache.hudi.common.bloom.filter.SimpleBloomFilter;
+
+import org.apache.hadoop.util.hash.Hash;
+import org.junit.Assert;
+import org.junit.Test;
+
+import java.util.UUID;
+
+public class TestGzipCompressionUtils {
+
+  @Test
+  public void testCompressDeCompress() {
 
 Review comment:
   Added

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] vinothchandar commented on issue #1253: [HUDI-558] Introduce ability to compress bloom filters while storing in parquet

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on issue #1253: [HUDI-558] Introduce ability to compress bloom filters while storing in parquet
URL: https://github.com/apache/incubator-hudi/pull/1253#issuecomment-577011630
 
 
   @nsivabalan can you also please review this. 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] bvaradar commented on a change in pull request #1253: [HUDI-558] Introduce ability to compress bloom filters while storing in parquet

Posted by GitBox <gi...@apache.org>.
bvaradar commented on a change in pull request #1253: [HUDI-558] Introduce ability to compress bloom filters while storing in parquet
URL: https://github.com/apache/incubator-hudi/pull/1253#discussion_r384914626
 
 

 ##########
 File path: hudi-client/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java
 ##########
 @@ -318,6 +318,10 @@ public double getBloomFilterFPP() {
     return Double.parseDouble(props.getProperty(HoodieIndexConfig.BLOOM_FILTER_FPP));
   }
 
+  public boolean isBloomFilterCompressionEnabled() {
+    return Boolean.valueOf(props.getProperty(HoodieIndexConfig.BLOOM_INDEX_ENABLE_COMPRESSION));
 
 Review comment:
   Done.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] lamber-ken edited a comment on issue #1253: [HUDI-558] Introduce ability to compress bloom filters while storing in parquet

Posted by GitBox <gi...@apache.org>.
lamber-ken edited a comment on issue #1253: [HUDI-558] Introduce ability to compress bloom filters while storing in parquet
URL: https://github.com/apache/incubator-hudi/pull/1253#issuecomment-592397157
 
 
   Hi @bvaradar, the idea of compressing strings is great, just considering: 
   
   Call time line will be: `byte[]` -> `base64 String` -> `gzip stream` -> `base64 String`
   
   ![image](https://user-images.githubusercontent.com/20113411/75521469-cc4a8a80-5a42-11ea-8d92-59b9f845d2d6.png)
   
   IMO, we can use gzip compress `byte[]` data directly, like:
   
   ![image](https://user-images.githubusercontent.com/20113411/75521938-bb4e4900-5a43-11ea-9399-8a8eeae72692.png)
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] nsivabalan commented on a change in pull request #1253: [HUDI-558] Introduce ability to compress bloom filters while storing in parquet

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on a change in pull request #1253: [HUDI-558] Introduce ability to compress bloom filters while storing in parquet
URL: https://github.com/apache/incubator-hudi/pull/1253#discussion_r372197069
 
 

 ##########
 File path: hudi-common/src/main/java/org/apache/hudi/common/bloom/filter/BloomFilterFactory.java
 ##########
 @@ -52,6 +52,7 @@ public static BloomFilter createBloomFilter(int numEntries, double errorRate, in
    * @return the {@link BloomFilter} thus generated from the passed in serialized string
    */
   public static BloomFilter fromString(String serString, String bloomFilterTypeCode) {
+
 
 Review comment:
   can we remove the additional line break?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] nsivabalan commented on issue #1253: [HUDI-558] Introduce ability to compress bloom filters while storing in parquet

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #1253: [HUDI-558] Introduce ability to compress bloom filters while storing in parquet
URL: https://github.com/apache/incubator-hudi/pull/1253#issuecomment-592868794
 
 
   I also played with testing the sizes. Looks like the encoding is the culprit. 
   
   test random keys
   Data before compress: 4792548
   Data after compress Stage 1 3630215
   Data after compress Stage 2 4967662
   
   
   // added these log statements.
    byte[] compressed = bos.toByteArray();
         System.out.println("Data after compress Stage 1 " + compressed.length);
         Base64.Encoder encoder = Base64.getMimeEncoder();
         String toReturn = new String(encoder.encode(compressed), StandardCharsets.UTF_8);
         System.out.println("Data after compress Stage 2 " + toReturn.length());
   
   
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] lamber-ken edited a comment on issue #1253: [HUDI-558] Introduce ability to compress bloom filters while storing in parquet

Posted by GitBox <gi...@apache.org>.
lamber-ken edited a comment on issue #1253: [HUDI-558] Introduce ability to compress bloom filters while storing in parquet
URL: https://github.com/apache/incubator-hudi/pull/1253#issuecomment-592397157
 
 
   Hi @bvaradar, the idea of compressing strings is great, just thinking: 
   
   Call time line will be: `byte[]` -> `base64 String` -> `gzip stream` -> `base64 String`
   
   ![image](https://user-images.githubusercontent.com/20113411/75521469-cc4a8a80-5a42-11ea-8d92-59b9f845d2d6.png)
   
   IMO, we can use gzip compress `byte[]` data directly, like:
   
   ![image](https://user-images.githubusercontent.com/20113411/75521938-bb4e4900-5a43-11ea-9399-8a8eeae72692.png)
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] smarthi commented on a change in pull request #1253: [HUDI-558] Introduce ability to compress bloom filters while storing in parquet

Posted by GitBox <gi...@apache.org>.
smarthi commented on a change in pull request #1253: [HUDI-558] Introduce ability to compress bloom filters while storing in parquet
URL: https://github.com/apache/incubator-hudi/pull/1253#discussion_r368710257
 
 

 ##########
 File path: hudi-common/src/main/java/org/apache/hudi/common/util/ParquetUtils.java
 ##########
 @@ -149,13 +150,26 @@ public static BloomFilter readBloomFilterFromParquetMetadata(Configuration confi
         readParquetFooter(configuration, false, parquetFilePath,
             HoodieAvroWriteSupport.HOODIE_AVRO_BLOOM_FILTER_METADATA_KEY,
             HoodieAvroWriteSupport.OLD_HOODIE_AVRO_BLOOM_FILTER_METADATA_KEY,
-            HoodieAvroWriteSupport.HOODIE_BLOOM_FILTER_TYPE_CODE);
+            HoodieAvroWriteSupport.HOODIE_BLOOM_FILTER_TYPE_CODE,
+            HoodieAvroWriteSupport.HOODIE_BLOOM_FILTER_IS_COMPRESSED,
+            HoodieAvroWriteSupport.HOODIE_BLOOM_FILTER_COMPRESSION_TYPE);
     String footerVal = footerVals.get(HoodieAvroWriteSupport.HOODIE_AVRO_BLOOM_FILTER_METADATA_KEY);
     if (null == footerVal) {
       // We use old style key "com.uber.hoodie.bloomfilter"
       footerVal = footerVals.get(HoodieAvroWriteSupport.OLD_HOODIE_AVRO_BLOOM_FILTER_METADATA_KEY);
     }
     BloomFilter toReturn = null;
+    boolean isCompressed = false;
+    if (footerVals.containsKey(HoodieAvroWriteSupport.HOODIE_BLOOM_FILTER_IS_COMPRESSED)) {
+      isCompressed = Boolean.valueOf(footerVals.get(HoodieAvroWriteSupport.HOODIE_BLOOM_FILTER_IS_COMPRESSED));
+      if (isCompressed) {
+        String compressionType = footerVals.get(HoodieAvroWriteSupport.HOODIE_BLOOM_FILTER_COMPRESSION_TYPE);
+        Preconditions.checkArgument(compressionType.equals(GzipCompressionUtils.TYPE),
 
 Review comment:
   this can be replaced with ValidationUtils.checkArgument() once the PR# 1159 has been merged 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] bvaradar commented on a change in pull request #1253: [HUDI-558] Introduce ability to compress bloom filters while storing in parquet

Posted by GitBox <gi...@apache.org>.
bvaradar commented on a change in pull request #1253: [HUDI-558] Introduce ability to compress bloom filters while storing in parquet
URL: https://github.com/apache/incubator-hudi/pull/1253#discussion_r384914566
 
 

 ##########
 File path: hudi-cli/src/main/scala/org/apache/hudi/cli/SparkHelpers.scala
 ##########
 @@ -43,7 +43,7 @@ object SparkHelpers {
     val schema: Schema = sourceRecords.get(0).getSchema
     val filter: BloomFilter = BloomFilterFactory.createBloomFilter(HoodieIndexConfig.DEFAULT_BLOOM_FILTER_NUM_ENTRIES.toInt, HoodieIndexConfig.DEFAULT_BLOOM_FILTER_FPP.toDouble,
       HoodieIndexConfig.DEFAULT_HOODIE_BLOOM_INDEX_FILTER_DYNAMIC_MAX_ENTRIES.toInt, HoodieIndexConfig.DEFAULT_BLOOM_INDEX_FILTER_TYPE);
-    val writeSupport: HoodieAvroWriteSupport = new HoodieAvroWriteSupport(new AvroSchemaConverter().convert(schema), schema, filter)
+    val writeSupport: HoodieAvroWriteSupport = new HoodieAvroWriteSupport(new AvroSchemaConverter().convert(schema), schema, filter, java.lang.Boolean.valueOf(HoodieIndexConfig.BLOOM_INDEX_ENABLE_COMPRESSION))
 
 Review comment:
   Done.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] lamber-ken edited a comment on issue #1253: [HUDI-558] Introduce ability to compress bloom filters while storing in parquet

Posted by GitBox <gi...@apache.org>.
lamber-ken edited a comment on issue #1253: [HUDI-558] Introduce ability to compress bloom filters while storing in parquet
URL: https://github.com/apache/incubator-hudi/pull/1253#issuecomment-592649734
 
 
   Hi @vinothchandar
   
   > we can only place strings inside the parquet footers
   
   Right, I know it. 
   
   `byte[]` -> `base64 string` -> `byte[]` unnecessary steps
   ![image](https://user-images.githubusercontent.com/20113411/75572628-1b6ddb00-5a96-11ea-8e8d-e66cd3883db8.png)
   
   ### What I want to say is
   ![image](https://user-images.githubusercontent.com/20113411/75572733-4e17d380-5a96-11ea-82c8-593e2083507d.png)
   
   
   
   
   
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] lamber-ken edited a comment on issue #1253: [HUDI-558] Introduce ability to compress bloom filters while storing in parquet

Posted by GitBox <gi...@apache.org>.
lamber-ken edited a comment on issue #1253: [HUDI-558] Introduce ability to compress bloom filters while storing in parquet
URL: https://github.com/apache/incubator-hudi/pull/1253#issuecomment-592649734
 
 
   Hi @vinothchandar
   
   > we can only place strings inside the parquet footers
   
   Right, I know it. 
   
   `byte[]` -> `base64 string` -> `byte[]` unnecessary steps
   ![image](https://user-images.githubusercontent.com/20113411/75572628-1b6ddb00-5a96-11ea-8e8d-e66cd3883db8.png)
   
   #### What I want to say is
   ![image](https://user-images.githubusercontent.com/20113411/75572733-4e17d380-5a96-11ea-82c8-593e2083507d.png)
   
   
   
   
   
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] smarthi commented on a change in pull request #1253: [HUDI-558] Introduce ability to compress bloom filters while storing in parquet

Posted by GitBox <gi...@apache.org>.
smarthi commented on a change in pull request #1253: [HUDI-558] Introduce ability to compress bloom filters while storing in parquet
URL: https://github.com/apache/incubator-hudi/pull/1253#discussion_r368708574
 
 

 ##########
 File path: hudi-client/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java
 ##########
 @@ -318,6 +318,10 @@ public double getBloomFilterFPP() {
     return Double.parseDouble(props.getProperty(HoodieIndexConfig.BLOOM_FILTER_FPP));
   }
 
+  public boolean isBloomFilterCompressionEnabled() {
+    return Boolean.valueOf(props.getProperty(HoodieIndexConfig.BLOOM_INDEX_ENABLE_COMPRESSION));
 
 Review comment:
   use Boolean.parseBoolean() instead ??

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] leesf commented on a change in pull request #1253: [HUDI-558] Introduce ability to compress bloom filters while storing in parquet

Posted by GitBox <gi...@apache.org>.
leesf commented on a change in pull request #1253: [HUDI-558] Introduce ability to compress bloom filters while storing in parquet
URL: https://github.com/apache/incubator-hudi/pull/1253#discussion_r376785421
 
 

 ##########
 File path: hudi-common/src/test/java/org/apache/hudi/common/util/TestGzipCompressionUtils.java
 ##########
 @@ -0,0 +1,51 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.util;
+
+import org.apache.hudi.common.bloom.filter.BloomFilter;
+import org.apache.hudi.common.bloom.filter.SimpleBloomFilter;
+
+import org.apache.hadoop.util.hash.Hash;
+import org.junit.Assert;
+import org.junit.Test;
+
+import java.util.UUID;
+
 
 Review comment:
   add some annotation for the class would be better.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] nsivabalan commented on a change in pull request #1253: [HUDI-558] Introduce ability to compress bloom filters while storing in parquet

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on a change in pull request #1253: [HUDI-558] Introduce ability to compress bloom filters while storing in parquet
URL: https://github.com/apache/incubator-hudi/pull/1253#discussion_r372198475
 
 

 ##########
 File path: hudi-common/src/test/java/org/apache/hudi/common/util/TestGzipCompressionUtils.java
 ##########
 @@ -0,0 +1,51 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.util;
+
+import org.apache.hudi.common.bloom.filter.BloomFilter;
+import org.apache.hudi.common.bloom.filter.SimpleBloomFilter;
+
+import org.apache.hadoop.util.hash.Hash;
+import org.junit.Assert;
+import org.junit.Test;
+
+import java.util.UUID;
+
+public class TestGzipCompressionUtils {
+
+  @Test
+  public void testCompressDeCompress() {
 
 Review comment:
   minor. Gzip is very generic compression for any string. While we do verify that bloom index instantiation works after decompressing, do you think we can also add tests as below: 
   generate random strings -> compress -> decompress and verify the string matches to original value generated. 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] bvaradar commented on a change in pull request #1253: [HUDI-558] Introduce ability to compress bloom filters while storing in parquet

Posted by GitBox <gi...@apache.org>.
bvaradar commented on a change in pull request #1253: [HUDI-558] Introduce ability to compress bloom filters while storing in parquet
URL: https://github.com/apache/incubator-hudi/pull/1253#discussion_r384914757
 
 

 ##########
 File path: hudi-common/src/main/java/org/apache/hudi/common/util/ParquetUtils.java
 ##########
 @@ -149,13 +150,26 @@ public static BloomFilter readBloomFilterFromParquetMetadata(Configuration confi
         readParquetFooter(configuration, false, parquetFilePath,
             HoodieAvroWriteSupport.HOODIE_AVRO_BLOOM_FILTER_METADATA_KEY,
             HoodieAvroWriteSupport.OLD_HOODIE_AVRO_BLOOM_FILTER_METADATA_KEY,
-            HoodieAvroWriteSupport.HOODIE_BLOOM_FILTER_TYPE_CODE);
+            HoodieAvroWriteSupport.HOODIE_BLOOM_FILTER_TYPE_CODE,
+            HoodieAvroWriteSupport.HOODIE_BLOOM_FILTER_IS_COMPRESSED,
+            HoodieAvroWriteSupport.HOODIE_BLOOM_FILTER_COMPRESSION_TYPE);
     String footerVal = footerVals.get(HoodieAvroWriteSupport.HOODIE_AVRO_BLOOM_FILTER_METADATA_KEY);
     if (null == footerVal) {
       // We use old style key "com.uber.hoodie.bloomfilter"
       footerVal = footerVals.get(HoodieAvroWriteSupport.OLD_HOODIE_AVRO_BLOOM_FILTER_METADATA_KEY);
     }
     BloomFilter toReturn = null;
+    boolean isCompressed = false;
+    if (footerVals.containsKey(HoodieAvroWriteSupport.HOODIE_BLOOM_FILTER_IS_COMPRESSED)) {
+      isCompressed = Boolean.valueOf(footerVals.get(HoodieAvroWriteSupport.HOODIE_BLOOM_FILTER_IS_COMPRESSED));
+      if (isCompressed) {
+        String compressionType = footerVals.get(HoodieAvroWriteSupport.HOODIE_BLOOM_FILTER_COMPRESSION_TYPE);
+        Preconditions.checkArgument(compressionType.equals(GzipCompressionUtils.TYPE),
 
 Review comment:
   Sounds good.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] bvaradar commented on issue #1253: [HUDI-558] Introduce ability to compress bloom filters while storing in parquet

Posted by GitBox <gi...@apache.org>.
bvaradar commented on issue #1253: [HUDI-558] Introduce ability to compress bloom filters while storing in parquet
URL: https://github.com/apache/incubator-hudi/pull/1253#issuecomment-591730505
 
 
   @vinothchandar : I will work on this and update the PR in a day. 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] bvaradar commented on a change in pull request #1253: [HUDI-558] Introduce ability to compress bloom filters while storing in parquet

Posted by GitBox <gi...@apache.org>.
bvaradar commented on a change in pull request #1253: [HUDI-558] Introduce ability to compress bloom filters while storing in parquet
URL: https://github.com/apache/incubator-hudi/pull/1253#discussion_r384914657
 
 

 ##########
 File path: hudi-common/src/main/java/org/apache/hudi/common/bloom/filter/BloomFilterFactory.java
 ##########
 @@ -52,6 +52,7 @@ public static BloomFilter createBloomFilter(int numEntries, double errorRate, in
    * @return the {@link BloomFilter} thus generated from the passed in serialized string
    */
   public static BloomFilter fromString(String serString, String bloomFilterTypeCode) {
+
 
 Review comment:
   Done.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] bvaradar commented on a change in pull request #1253: [HUDI-558] Introduce ability to compress bloom filters while storing in parquet

Posted by GitBox <gi...@apache.org>.
bvaradar commented on a change in pull request #1253: [HUDI-558] Introduce ability to compress bloom filters while storing in parquet
URL: https://github.com/apache/incubator-hudi/pull/1253#discussion_r384914597
 
 

 ##########
 File path: hudi-client/src/main/java/org/apache/hudi/config/HoodieIndexConfig.java
 ##########
 @@ -166,6 +168,11 @@ public Builder bloomIndexBucketizedChecking(boolean bucketizedChecking) {
       return this;
     }
 
+    public Builder bloomIndexEnableCompression(boolean enableCompression) {
+      props.setProperty(BLOOM_INDEX_ENABLE_COMPRESSION, String.valueOf(enableCompression));
 
 Review comment:
   Done.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] bvaradar commented on issue #1253: [HUDI-558] Introduce ability to compress bloom filters while storing in parquet

Posted by GitBox <gi...@apache.org>.
bvaradar commented on issue #1253: [HUDI-558] Introduce ability to compress bloom filters while storing in parquet
URL: https://github.com/apache/incubator-hudi/pull/1253#issuecomment-593560005
 
 
   @lamber-ken @leesf @nsivabalan : Yes, the additional string conversion is not needed. So, I refactored a little bit to use correct bloom-filter serialization method (based on whether compression is enabled or not). 
   
   @lamber-ken : I am observing the same behavior when comparing compression vs non-compression case. I see that compression performs poorly based on the bloom filter utilization (number of keys stored in bloom-filter).  I see that snappy also behaves in the same way (although poorly compared to gzip).  I would need to investigate further on this.
   
   Result 
   
   ```
   test random keys
   original size: 4792548
   compress size (utilization=10%) : 2150956, CompressToOriginal=44
   compress size (utilization=20%) : 3078736, CompressToOriginal=64
   compress size (utilization=30%) : 3638548, CompressToOriginal=75
   compress size (utilization=40%) : 3977508, CompressToOriginal=82
   compress size (utilization=50%) : 4258972, CompressToOriginal=88
   compress size (utilization=60%) : 4490484, CompressToOriginal=93
   compress size (utilization=70%) : 4647776, CompressToOriginal=96
   compress size (utilization=80%) : 4750028, CompressToOriginal=99
   compress size (utilization=90%) : 4794040, CompressToOriginal=100
   
   test sequential keys
   original size: 4792548
   Using Byte[] - compress size (utilization=10%) : 2150852, CompressToOriginal=44
   Using Byte[] - compress size (utilization=20%) : 3078332, CompressToOriginal=64
   Using Byte[] - compress size (utilization=30%) : 3639000, CompressToOriginal=75
   Using Byte[] - compress size (utilization=40%) : 3977764, CompressToOriginal=82
   Using Byte[] - compress size (utilization=50%) : 4258544, CompressToOriginal=88
   Using Byte[] - compress size (utilization=60%) : 4490372, CompressToOriginal=93
   Using Byte[] - compress size (utilization=70%) : 4647832, CompressToOriginal=96
   Using Byte[] - compress size (utilization=80%) : 4749928, CompressToOriginal=99
   Using Byte[] - compress size (utilization=90%) : 4794040, CompressToOriginal=100
   
   Process finished with exit code 0
   
   ```
   
   Test - Code : 
   ```
   @Test
     public void testit() {
       int[] utilization = new int[] { 10, 20, 30, 40, 50, 60, 70, 80, 90};
   
       System.out.println("test random keys");
       int originalSize = 0;
       for (int i = 0; i < utilization.length; i++) {
         SimpleBloomFilter filter = new SimpleBloomFilter(1000000, 0.000001, Hash.MURMUR_HASH);
         int numKeys = 10000 * utilization[i];
         for (int j = 0; j < numKeys; j++) {
           String key = UUID.randomUUID().toString();
           filter.add(key);
         }
   
         if (i == 0) {
           originalSize = filter.serializeToString().length();
           System.out.println("original size: " + filter.serializeToString().length());
         }
         int compressedSize = GzipCompressionUtils.compress(filter.serializeToBytes()).length();
         System.out.println("compress size (utilization=" + utilization[i] + "%) : "
             +  compressedSize + ", CompressToOriginal=" + (compressedSize * 100/originalSize));
       }
   
       System.out.println("\ntest sequential keys");
   
       for (int i = 0; i < utilization.length; i++) {
         SimpleBloomFilter filter = new SimpleBloomFilter(1000000, 0.000001, Hash.MURMUR_HASH);
         int numKeys = 10000 * utilization[i];
         for (int j = 0; j < numKeys; j++) {
           String key = "key-" + j;
           filter.add(key);
         }
         if (i == 0) {
           originalSize = filter.serializeToString().length();
           System.out.println("original size: " + filter.serializeToString().length());
         }
         int compressedSize = GzipCompressionUtils.compress(filter.serializeToBytes()).length();
         System.out.println("Using Byte[] - compress size (utilization=" + utilization[i] + "%) : "
             + compressedSize + ", CompressToOriginal=" + (compressedSize * 100/originalSize));
       }
     }
   ```
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] lamber-ken edited a comment on issue #1253: [HUDI-558] Introduce ability to compress bloom filters while storing in parquet

Posted by GitBox <gi...@apache.org>.
lamber-ken edited a comment on issue #1253: [HUDI-558] Introduce ability to compress bloom filters while storing in parquet
URL: https://github.com/apache/incubator-hudi/pull/1253#issuecomment-592397157
 
 
   Hi @bvaradar, 
   Call time line will be: `byte[]` -> `base64 String` -> `gzip stream` -> `base64 String`
   
   ![image](https://user-images.githubusercontent.com/20113411/75521469-cc4a8a80-5a42-11ea-8d92-59b9f845d2d6.png)
   
   IMO, we can use gzip compress `byte[]` data directly, like:
   
   ![image](https://user-images.githubusercontent.com/20113411/75521938-bb4e4900-5a43-11ea-9399-8a8eeae72692.png)
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] nsivabalan commented on issue #1253: [HUDI-558] Introduce ability to compress bloom filters while storing in parquet

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #1253: [HUDI-558] Introduce ability to compress bloom filters while storing in parquet
URL: https://github.com/apache/incubator-hudi/pull/1253#issuecomment-577423114
 
 
   sure. will take a look. 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] vinothchandar closed pull request #1253: [WIP] [HUDI-558] Introduce ability to compress bloom filters while storing in parquet

Posted by GitBox <gi...@apache.org>.
vinothchandar closed pull request #1253:
URL: https://github.com/apache/incubator-hudi/pull/1253


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org