You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@carbondata.apache.org by ravipesala <gi...@git.apache.org> on 2018/05/20 16:47:04 UTC

[GitHub] carbondata pull request #2324: [CARBONDATA-2496] Changed to hadoop bloom imp...

GitHub user ravipesala opened a pull request:

    https://github.com/apache/carbondata/pull/2324

    [CARBONDATA-2496] Changed to hadoop bloom implementation and added compress option to compress bloom on disk.

    This PR removes the guava bloom and adds the hadoop bloom. And also added the compress bloom option to compress bloom on disk and in memory as well.
    The user can use `bloom_compress` property to enable/disable compression. By default, it is enabled.
    Please check the performance of bloom
    Loaded 100 million data with bloom datamap on a column with a cardinality of 5 million with 'BLOOM_SIZE'='5000000', 'bloom_fpp'='0.001' .
    
    Guava
    -----------------------------
    DataMap Size : 233.6 MB
    Fisrt Read : 4.981 sec
    Second Read: 0.541 sec
    
    Hadoop
    -----------------------------------
    DataMap Size : 224.7 MB
    Fisrt Read : 4.829 sec
    Second Read: 0.555 sec
    
    Haoop with compression
    ------------------------------------
    DataMap Size : 98.8 MB
    Fisrt Read : 4.984 sec
    Second Read: 0.621 sec
    
    As per the above readings compressed Hadoop implementation gives optimal performance in terms of space and read.Thats why compress is made as default.
    
    Here RoaringBitMap is used internally for compressing the bloom so it s not only space efficient on disk and also efficient in memory.
    
    Be sure to do all of the following checklist to help us incorporate 
    your contribution quickly and easily:
    
     - [ ] Any interfaces changed?
     
     - [ ] Any backward compatibility impacted?
     
     - [ ] Document update required?
    
     - [ ] Testing done
            Please provide details on 
            - Whether new unit test cases have been added or why no new tests are required?
            - How it is tested? Please attach test report.
            - Is it a performance related change? Please attach the performance test report.
            - Any additional information to help reviewers in testing this change.
           
     - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. 
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/ravipesala/incubator-carbondata bloom-change1

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/carbondata/pull/2324.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2324
    
----
commit 5bf9726bd5f5703bb8ef51abf6a6c75bc5423586
Author: ravipesala <ra...@...>
Date:   2018-05-20T16:22:57Z

    Changed to hadoop bloom implementation and added compress option to compress bloom on disk.

----


---

[GitHub] carbondata issue #2324: [CARBONDATA-2496] Changed to hadoop bloom implementa...

Posted by ravipesala <gi...@git.apache.org>.

Github user ravipesala commented on the issue:

    https://github.com/apache/carbondata/pull/2324
  
    @xuchuanyin it is all about the cardinality of column he is trying to create a bloom, I don't think we should create very high cardinality bloom by default. If the user wants he can still pass through the property. We cannot make the defaults very high as per the specific scenarios.



---

[GitHub] carbondata issue #2324: [CARBONDATA-2496] Changed to hadoop bloom implementa...

Posted by ravipesala <gi...@git.apache.org>.

Github user ravipesala commented on the issue:

    https://github.com/apache/carbondata/pull/2324
  
    SDV Build Success , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/5006/



---

[GitHub] carbondata pull request #2324: [CARBONDATA-2496] Changed to hadoop bloom imp...

Posted by xuchuanyin <gi...@git.apache.org>.

Github user xuchuanyin commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2324#discussion_r189611278
  
    --- Diff: datamap/bloom/src/main/java/org/apache/hadoop/util/bloom/CarbonBloomFilter.java ---
    @@ -0,0 +1,108 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +package org.apache.hadoop.util.bloom;
    +
    +import java.io.DataInput;
    +import java.io.DataOutput;
    +import java.io.IOException;
    +import java.util.BitSet;
    +
    +import org.roaringbitmap.RoaringBitmap;
    +
    +/**
    + * It is the extendable class to hadoop bloomfilter, it is extendable to implement compressed bloom
    + * and fast serialize and deserialize of bloom.
    + */
    +public class CarbonBloomFilter extends BloomFilter {
    +
    +  private RoaringBitmap bitmap;
    +
    +  private boolean compress;
    +
    +  public CarbonBloomFilter() {
    +  }
    +
    +  public CarbonBloomFilter(int vectorSize, int nbHash, int hashType, boolean compress) {
    +    super(vectorSize, nbHash, hashType);
    +    this.compress = compress;
    +  }
    +
    +  @Override
    +  public boolean membershipTest(Key key) {
    +    if (key == null) {
    +      throw new NullPointerException("key cannot be null");
    +    }
    +
    +    int[] h = hash.hash(key);
    +    hash.clear();
    +    if (compress) {
    +      // If it is compressed chek in roaring bitmap
    --- End diff --
    
    chek?


---

[GitHub] carbondata pull request #2324: [CARBONDATA-2496] Changed to hadoop bloom imp...

Posted by ravipesala <gi...@git.apache.org>.

Github user ravipesala commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2324#discussion_r189647130
  
    --- Diff: datamap/bloom/src/main/java/org/apache/carbondata/datamap/bloom/BloomDataMapCache.java ---
    @@ -133,15 +132,14 @@ private int validateAndGetCacheSize() {
        */
       private List<BloomDMModel> loadBloomDataMapModel(CacheKey cacheKey) {
         DataInputStream dataInStream = null;
    -    ObjectInputStream objectInStream = null;
         List<BloomDMModel> bloomDMModels = new ArrayList<BloomDMModel>();
         try {
           String indexFile = getIndexFileFromCacheKey(cacheKey);
           dataInStream = FileFactory.getDataInputStream(indexFile, FileFactory.getFileType(indexFile));
    -      objectInStream = new ObjectInputStream(dataInStream);
           try {
    -        BloomDMModel model = null;
    -        while ((model = (BloomDMModel) objectInStream.readObject()) != null) {
    +        while (dataInStream.available() > 0) {
    --- End diff --
    
    I don't know the problem you are facing, but the old cannot work as we are not doing object serialization here.


---

[GitHub] carbondata pull request #2324: [CARBONDATA-2496] Changed to hadoop bloom imp...

Posted by ravipesala <gi...@git.apache.org>.

Github user ravipesala commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2324#discussion_r189647160
  
    --- Diff: datamap/bloom/src/main/java/org/apache/carbondata/datamap/bloom/BloomCoarseGrainDataMapFactory.java ---
    @@ -163,21 +176,37 @@ private double validateAndGetBloomFilterFpp(DataMapSchema dmSchema)
         return bloomFilterFpp;
       }
     
    +  /**
    +   * validate bloom DataMap BLOOM_FPP
    --- End diff --
    
    ok


---

[GitHub] carbondata issue #2324: [CARBONDATA-2496] Changed to hadoop bloom implementa...

Posted by CarbonDataQA <gi...@git.apache.org>.

Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2324
  
    Build Failed with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/4868/



---

[GitHub] carbondata pull request #2324: [CARBONDATA-2496] Changed to hadoop bloom imp...

Posted by ravipesala <gi...@git.apache.org>.

Github user ravipesala commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2324#discussion_r189647499
  
    --- Diff: datamap/bloom/src/main/java/org/apache/carbondata/datamap/bloom/BloomDMModel.java ---
    @@ -40,15 +46,29 @@ public int getBlockletNo() {
         return blockletNo;
       }
     
    -  public BloomFilter<byte[]> getBloomFilter() {
    +  public BloomFilter getBloomFilter() {
    --- End diff --
    
    ok


---

[GitHub] carbondata issue #2324: [CARBONDATA-2496] Changed to hadoop bloom implementa...

Posted by CarbonDataQA <gi...@git.apache.org>.

Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2324
  
    Build Failed  with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/5989/



---

[GitHub] carbondata pull request #2324: [CARBONDATA-2496] Changed to hadoop bloom imp...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/carbondata/pull/2324


---

[GitHub] carbondata pull request #2324: [CARBONDATA-2496] Changed to hadoop bloom imp...

Posted by ravipesala <gi...@git.apache.org>.

Github user ravipesala commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2324#discussion_r189494115
  
    --- Diff: datamap/bloom/src/main/java/org/apache/carbondata/datamap/bloom/BloomCoarseGrainDataMapFactory.java ---
    @@ -66,22 +66,32 @@
        */
       private static final String BLOOM_SIZE = "bloom_size";
       /**
    -   * default size for bloom filter: suppose one blocklet contains 20 pages
    -   * and all the indexed value is distinct.
    +   * default size for bloom filter, cardinality of the column.
        */
    -  private static final int DEFAULT_BLOOM_FILTER_SIZE = 32000 * 20;
    +  private static final int DEFAULT_BLOOM_FILTER_SIZE = Short.MAX_VALUE;
    --- End diff --
    
    ok


---

[GitHub] carbondata pull request #2324: [CARBONDATA-2496] Changed to hadoop bloom imp...

Posted by xuchuanyin <gi...@git.apache.org>.

Github user xuchuanyin commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2324#discussion_r189612466
  
    --- Diff: datamap/bloom/src/main/java/org/apache/carbondata/datamap/bloom/BloomDMModel.java ---
    @@ -40,15 +46,29 @@ public int getBlockletNo() {
         return blockletNo;
       }
     
    -  public BloomFilter<byte[]> getBloomFilter() {
    +  public BloomFilter getBloomFilter() {
    --- End diff --
    
    return CarbonBloomFilter


---

[GitHub] carbondata issue #2324: [CARBONDATA-2496] Changed to hadoop bloom implementa...

Posted by CarbonDataQA <gi...@git.apache.org>.

Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2324
  
    Build Failed with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/4824/



---

[GitHub] carbondata issue #2324: [CARBONDATA-2496] Changed to hadoop bloom implementa...

Posted by CarbonDataQA <gi...@git.apache.org>.

Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2324
  
    Build Failed with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/4825/



---

[GitHub] carbondata issue #2324: [CARBONDATA-2496] Changed to hadoop bloom implementa...

Posted by xuchuanyin <gi...@git.apache.org>.

Github user xuchuanyin commented on the issue:

    https://github.com/apache/carbondata/pull/2324
  
    @ravipesala @jackylk
    I'm not OK with the default size and fpp for bloomfilter. In general usecases, it will result in many false samples.
    In my test, a 96GB customer table with size=320K & fpp=0.00001 will result in 18 false samples.


---

[GitHub] carbondata pull request #2324: [CARBONDATA-2496] Changed to hadoop bloom imp...

Posted by ravipesala <gi...@git.apache.org>.

Github user ravipesala commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2324#discussion_r189647876
  
    --- Diff: datamap/bloom/src/main/java/org/apache/carbondata/datamap/bloom/BloomDataMapWriter.java ---
    @@ -86,12 +86,31 @@ public void onBlockletStart(int blockletId) {
       protected void resetBloomFilters() {
         indexBloomFilters.clear();
         List<CarbonColumn> indexColumns = getIndexColumns();
    +    int[] stats = calculateBloomStats();
         for (int i = 0; i < indexColumns.size(); i++) {
    -      indexBloomFilters.add(BloomFilter.create(Funnels.byteArrayFunnel(),
    -          bloomFilterSize, bloomFilterFpp));
    +      indexBloomFilters
    +          .add(new CarbonBloomFilter(stats[0], stats[1], Hash.MURMUR_HASH, compressBloom));
         }
       }
     
    +  /**
    +   * It calculates the bits size and number of hash functions to calculate bloom.
    +   */
    +  private int[] calculateBloomStats() {
    +    /*
    +     * n: how many items you expect to have in your filter
    +     * p: your acceptable false positive rate
    +     * Number of bits (m) = -n*ln(p) / (ln(2)^2)
    +     * Number of hashes(k) = m/n * ln(2)
    --- End diff --
    
    Can't as `k` is dependent on `m`


---

[GitHub] carbondata issue #2324: [CARBONDATA-2496] Changed to hadoop bloom implementa...

Posted by CarbonDataQA <gi...@git.apache.org>.

Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2324
  
    Build Success with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/4877/



---

[GitHub] carbondata issue #2324: [CARBONDATA-2496] Changed to hadoop bloom implementa...

Posted by CarbonDataQA <gi...@git.apache.org>.

Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2324
  
    Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/6000/



---

[GitHub] carbondata pull request #2324: [CARBONDATA-2496] Changed to hadoop bloom imp...

Posted by xuchuanyin <gi...@git.apache.org>.

Github user xuchuanyin commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2324#discussion_r189613145
  
    --- Diff: datamap/bloom/src/main/java/org/apache/carbondata/datamap/bloom/BloomDataMapWriter.java ---
    @@ -86,12 +86,31 @@ public void onBlockletStart(int blockletId) {
       protected void resetBloomFilters() {
         indexBloomFilters.clear();
         List<CarbonColumn> indexColumns = getIndexColumns();
    +    int[] stats = calculateBloomStats();
         for (int i = 0; i < indexColumns.size(); i++) {
    -      indexBloomFilters.add(BloomFilter.create(Funnels.byteArrayFunnel(),
    -          bloomFilterSize, bloomFilterFpp));
    +      indexBloomFilters
    +          .add(new CarbonBloomFilter(stats[0], stats[1], Hash.MURMUR_HASH, compressBloom));
         }
       }
     
    +  /**
    +   * It calculates the bits size and number of hash functions to calculate bloom.
    +   */
    +  private int[] calculateBloomStats() {
    +    /*
    +     * n: how many items you expect to have in your filter
    +     * p: your acceptable false positive rate
    +     * Number of bits (m) = -n*ln(p) / (ln(2)^2)
    +     * Number of hashes(k) = m/n * ln(2)
    --- End diff --
    
    Why not calculate `k` before `m`? It can save some evaluation.


---

[GitHub] carbondata pull request #2324: [CARBONDATA-2496] Changed to hadoop bloom imp...

Posted by jackylk <gi...@git.apache.org>.

Github user jackylk commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2324#discussion_r189477234
  
    --- Diff: datamap/bloom/src/main/java/org/apache/carbondata/datamap/bloom/BloomCoarseGrainDataMapFactory.java ---
    @@ -66,22 +66,32 @@
        */
       private static final String BLOOM_SIZE = "bloom_size";
       /**
    -   * default size for bloom filter: suppose one blocklet contains 20 pages
    -   * and all the indexed value is distinct.
    +   * default size for bloom filter, cardinality of the column.
        */
    -  private static final int DEFAULT_BLOOM_FILTER_SIZE = 32000 * 20;
    +  private static final int DEFAULT_BLOOM_FILTER_SIZE = Short.MAX_VALUE;
    --- End diff --
    
    Can you make a page size constant and use it, so that we can easily change it later when we make page size configurable


---

[GitHub] carbondata issue #2324: [CARBONDATA-2496] Changed to hadoop bloom implementa...

Posted by CarbonDataQA <gi...@git.apache.org>.

Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2324
  
    Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/6027/



---

[GitHub] carbondata pull request #2324: [CARBONDATA-2496] Changed to hadoop bloom imp...

Posted by xuchuanyin <gi...@git.apache.org>.

Github user xuchuanyin commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2324#discussion_r189609624
  
    --- Diff: datamap/bloom/src/main/java/org/apache/carbondata/datamap/bloom/BloomDataMapCache.java ---
    @@ -133,15 +132,14 @@ private int validateAndGetCacheSize() {
        */
       private List<BloomDMModel> loadBloomDataMapModel(CacheKey cacheKey) {
         DataInputStream dataInStream = null;
    -    ObjectInputStream objectInStream = null;
         List<BloomDMModel> bloomDMModels = new ArrayList<BloomDMModel>();
         try {
           String indexFile = getIndexFileFromCacheKey(cacheKey);
           dataInStream = FileFactory.getDataInputStream(indexFile, FileFactory.getFileType(indexFile));
    -      objectInStream = new ObjectInputStream(dataInStream);
           try {
    -        BloomDMModel model = null;
    -        while ((model = (BloomDMModel) objectInStream.readObject()) != null) {
    +        while (dataInStream.available() > 0) {
    --- End diff --
    
    I've checked the `available` method when I contributed these lines of code and found it's not suitable here.
    Better to keep them as it is.
    Maybe you can check it again.


---

[GitHub] carbondata issue #2324: [CARBONDATA-2496] Changed to hadoop bloom implementa...

Posted by ravipesala <gi...@git.apache.org>.

Github user ravipesala commented on the issue:

    https://github.com/apache/carbondata/pull/2324
  
    SDV Build Success , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/5042/



---

[GitHub] carbondata issue #2324: [CARBONDATA-2496] Changed to hadoop bloom implementa...

Posted by ravipesala <gi...@git.apache.org>.

Github user ravipesala commented on the issue:

    https://github.com/apache/carbondata/pull/2324
  
    SDV Build Success , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/5014/



---

[GitHub] carbondata issue #2324: [CARBONDATA-2496] Changed to hadoop bloom implementa...

Posted by ravipesala <gi...@git.apache.org>.

Github user ravipesala commented on the issue:

    https://github.com/apache/carbondata/pull/2324
  
    SDV Build Success , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/5046/



---

[GitHub] carbondata issue #2324: [CARBONDATA-2496] Changed to hadoop bloom implementa...

Posted by jackylk <gi...@git.apache.org>.

Github user jackylk commented on the issue:

    https://github.com/apache/carbondata/pull/2324
  
    LGTM


---

[GitHub] carbondata issue #2324: [CARBONDATA-2496] Changed to hadoop bloom implementa...

Posted by ravipesala <gi...@git.apache.org>.

Github user ravipesala commented on the issue:

    https://github.com/apache/carbondata/pull/2324
  
    SDV Build Success , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/5005/



---

[GitHub] carbondata issue #2324: [CARBONDATA-2496] Changed to hadoop bloom implementa...

Posted by CarbonDataQA <gi...@git.apache.org>.

Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2324
  
    Build Success with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/4841/



---

[GitHub] carbondata pull request #2324: [CARBONDATA-2496] Changed to hadoop bloom imp...

Posted by xuchuanyin <gi...@git.apache.org>.

Github user xuchuanyin commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2324#discussion_r189797915
  
    --- Diff: datamap/bloom/src/main/java/org/apache/carbondata/datamap/bloom/BloomDataMapCache.java ---
    @@ -133,15 +132,14 @@ private int validateAndGetCacheSize() {
        */
       private List<BloomDMModel> loadBloomDataMapModel(CacheKey cacheKey) {
         DataInputStream dataInStream = null;
    -    ObjectInputStream objectInStream = null;
         List<BloomDMModel> bloomDMModels = new ArrayList<BloomDMModel>();
         try {
           String indexFile = getIndexFileFromCacheKey(cacheKey);
           dataInStream = FileFactory.getDataInputStream(indexFile, FileFactory.getFileType(indexFile));
    -      objectInStream = new ObjectInputStream(dataInStream);
           try {
    -        BloomDMModel model = null;
    -        while ((model = (BloomDMModel) objectInStream.readObject()) != null) {
    +        while (dataInStream.available() > 0) {
    --- End diff --
    
    It seems that you use `dataInStream.available()== 0` to indicate that all the content of the stream has been consumed, however it only means that you have only consumed all the available content with non-blocking, there may be leftover content in the stream which will be reached for the next blocking call.


---

[GitHub] carbondata issue #2324: [CARBONDATA-2496] Changed to hadoop bloom implementa...

Posted by ravipesala <gi...@git.apache.org>.

Github user ravipesala commented on the issue:

    https://github.com/apache/carbondata/pull/2324
  
    SDV Build Success , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/5010/



---

[GitHub] carbondata pull request #2324: [CARBONDATA-2496] Changed to hadoop bloom imp...

Posted by jackylk <gi...@git.apache.org>.

Github user jackylk commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2324#discussion_r189477491
  
    --- Diff: datamap/bloom/src/main/java/org/apache/hadoop/util/bloom/CarbonBloomFilter.java ---
    @@ -0,0 +1,103 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +package org.apache.hadoop.util.bloom;
    +
    +import java.io.DataInput;
    +import java.io.DataOutput;
    +import java.io.IOException;
    +import java.util.BitSet;
    +
    +import org.roaringbitmap.RoaringBitmap;
    +
    +public class CarbonBloomFilter extends BloomFilter {
    --- End diff --
    
    Please add comment


---

[GitHub] carbondata pull request #2324: [CARBONDATA-2496] Changed to hadoop bloom imp...

Posted by xuchuanyin <gi...@git.apache.org>.

Github user xuchuanyin commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2324#discussion_r189798225
  
    --- Diff: datamap/bloom/src/main/java/org/apache/carbondata/datamap/bloom/BloomDataMapCache.java ---
    @@ -133,15 +132,14 @@ private int validateAndGetCacheSize() {
        */
       private List<BloomDMModel> loadBloomDataMapModel(CacheKey cacheKey) {
         DataInputStream dataInStream = null;
    -    ObjectInputStream objectInStream = null;
         List<BloomDMModel> bloomDMModels = new ArrayList<BloomDMModel>();
         try {
           String indexFile = getIndexFileFromCacheKey(cacheKey);
           dataInStream = FileFactory.getDataInputStream(indexFile, FileFactory.getFileType(indexFile));
    -      objectInStream = new ObjectInputStream(dataInStream);
           try {
    -        BloomDMModel model = null;
    -        while ((model = (BloomDMModel) objectInStream.readObject()) != null) {
    +        while (dataInStream.available() > 0) {
    --- End diff --
    
    That's my understanding of the javadoc.
    If the datamap file is small, it may not cause problem, but I insist that it will cause problem in some scenario.


---

[GitHub] carbondata pull request #2324: [CARBONDATA-2496] Changed to hadoop bloom imp...

Posted by xuchuanyin <gi...@git.apache.org>.

Github user xuchuanyin commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2324#discussion_r189605492
  
    --- Diff: datamap/bloom/src/main/java/org/apache/carbondata/datamap/bloom/BloomCoarseGrainDataMapFactory.java ---
    @@ -163,21 +176,37 @@ private double validateAndGetBloomFilterFpp(DataMapSchema dmSchema)
         return bloomFilterFpp;
       }
     
    +  /**
    +   * validate bloom DataMap BLOOM_FPP
    --- End diff --
    
    fix the description 


---

[GitHub] carbondata issue #2324: [CARBONDATA-2496] Changed to hadoop bloom implementa...

Posted by CarbonDataQA <gi...@git.apache.org>.

Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2324
  
    Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/6036/



---

[GitHub] carbondata pull request #2324: [CARBONDATA-2496] Changed to hadoop bloom imp...

Posted by xuchuanyin <gi...@git.apache.org>.

Github user xuchuanyin commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2324#discussion_r189797259
  
    --- Diff: datamap/bloom/src/main/java/org/apache/carbondata/datamap/bloom/BloomDataMapCache.java ---
    @@ -133,15 +132,14 @@ private int validateAndGetCacheSize() {
        */
       private List<BloomDMModel> loadBloomDataMapModel(CacheKey cacheKey) {
         DataInputStream dataInStream = null;
    -    ObjectInputStream objectInStream = null;
         List<BloomDMModel> bloomDMModels = new ArrayList<BloomDMModel>();
         try {
           String indexFile = getIndexFileFromCacheKey(cacheKey);
           dataInStream = FileFactory.getDataInputStream(indexFile, FileFactory.getFileType(indexFile));
    -      objectInStream = new ObjectInputStream(dataInStream);
           try {
    -        BloomDMModel model = null;
    -        while ((model = (BloomDMModel) objectInStream.readObject()) != null) {
    +        while (dataInStream.available() > 0) {
    --- End diff --
    
    The javadoc said
    ```
    Returns an estimate of the number of bytes that can be read (or skipped over)
    from this input stream without blocking by the next invocation of a method
    for this input stream. 
    ```
    So `dataInStream.available()== 0` does not mean that we have already read all the content from the stream.


---

[GitHub] carbondata pull request #2324: [CARBONDATA-2496] Changed to hadoop bloom imp...

Posted by ravipesala <gi...@git.apache.org>.

Github user ravipesala commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2324#discussion_r189647321
  
    --- Diff: datamap/bloom/src/main/java/org/apache/hadoop/util/bloom/CarbonBloomFilter.java ---
    @@ -0,0 +1,108 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +package org.apache.hadoop.util.bloom;
    +
    +import java.io.DataInput;
    +import java.io.DataOutput;
    +import java.io.IOException;
    +import java.util.BitSet;
    +
    +import org.roaringbitmap.RoaringBitmap;
    +
    +/**
    + * It is the extendable class to hadoop bloomfilter, it is extendable to implement compressed bloom
    + * and fast serialize and deserialize of bloom.
    + */
    +public class CarbonBloomFilter extends BloomFilter {
    +
    +  private RoaringBitmap bitmap;
    +
    +  private boolean compress;
    +
    +  public CarbonBloomFilter() {
    +  }
    +
    +  public CarbonBloomFilter(int vectorSize, int nbHash, int hashType, boolean compress) {
    +    super(vectorSize, nbHash, hashType);
    +    this.compress = compress;
    +  }
    +
    +  @Override
    +  public boolean membershipTest(Key key) {
    +    if (key == null) {
    +      throw new NullPointerException("key cannot be null");
    +    }
    +
    +    int[] h = hash.hash(key);
    +    hash.clear();
    +    if (compress) {
    +      // If it is compressed chek in roaring bitmap
    --- End diff --
    
    ok


---

[GitHub] carbondata issue #2324: [CARBONDATA-2496] Changed to hadoop bloom implementa...

Posted by CarbonDataQA <gi...@git.apache.org>.

Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2324
  
    Build Failed  with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/5982/



---

[GitHub] carbondata issue #2324: [CARBONDATA-2496] Changed to hadoop bloom implementa...

Posted by CarbonDataQA <gi...@git.apache.org>.

Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2324
  
    Build Failed with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/4830/



---

[GitHub] carbondata issue #2324: [CARBONDATA-2496] Changed to hadoop bloom implementa...

Posted by CarbonDataQA <gi...@git.apache.org>.

Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2324
  
    Build Failed  with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/5983/



---

[GitHub] carbondata pull request #2324: [CARBONDATA-2496] Changed to hadoop bloom imp...

Posted by ravipesala <gi...@git.apache.org>.

Github user ravipesala commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2324#discussion_r189494121
  
    --- Diff: datamap/bloom/src/main/java/org/apache/hadoop/util/bloom/CarbonBloomFilter.java ---
    @@ -0,0 +1,103 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +package org.apache.hadoop.util.bloom;
    +
    +import java.io.DataInput;
    +import java.io.DataOutput;
    +import java.io.IOException;
    +import java.util.BitSet;
    +
    +import org.roaringbitmap.RoaringBitmap;
    +
    +public class CarbonBloomFilter extends BloomFilter {
    --- End diff --
    
    ok


---