You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@carbondata.apache.org by ja...@apache.org on 2018/08/07 13:09:43 UTC
[19/50] [abbrv] carbondata git commit: [CARBONDATA-2790][BloomDataMap]Optimize default parameter for bloomfilter datamap

[CARBONDATA-2790][BloomDataMap]Optimize default parameter for bloomfilter datamap

To provide better query performance for bloomfilter datamap by default,
we optimize bloom_size from 32000 to 640000 and optimize bloom_fpp from
0.01 to 0.00001.

This closes #2567


Project: http://git-wip-us.apache.org/repos/asf/carbondata/repo
Commit: http://git-wip-us.apache.org/repos/asf/carbondata/commit/6351c3a0
Tree: http://git-wip-us.apache.org/repos/asf/carbondata/tree/6351c3a0
Diff: http://git-wip-us.apache.org/repos/asf/carbondata/diff/6351c3a0

Branch: refs/heads/external-format
Commit: 6351c3a077c0fa47390c4b30b05de7d830b387d1
Parents: c29aef8
Author: xuchuanyin <xu...@hust.edu.cn>
Authored: Fri Jul 27 11:54:21 2018 +0800
Committer: Jacky Li <ja...@qq.com>
Committed: Wed Aug 1 17:43:49 2018 +0800

----------------------------------------------------------------------
 .../datamap/bloom/BloomCoarseGrainDataMapFactory.java          | 6 +++---
 docs/datamap/bloomfilter-datamap-guide.md                      | 4 ++--
 2 files changed, 5 insertions(+), 5 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/carbondata/blob/6351c3a0/datamap/bloom/src/main/java/org/apache/carbondata/datamap/bloom/BloomCoarseGrainDataMapFactory.java
----------------------------------------------------------------------
diff --git a/datamap/bloom/src/main/java/org/apache/carbondata/datamap/bloom/BloomCoarseGrainDataMapFactory.java b/datamap/bloom/src/main/java/org/apache/carbondata/datamap/bloom/BloomCoarseGrainDataMapFactory.java
index 652e1fc..80a86cc 100644
--- a/datamap/bloom/src/main/java/org/apache/carbondata/datamap/bloom/BloomCoarseGrainDataMapFactory.java
+++ b/datamap/bloom/src/main/java/org/apache/carbondata/datamap/bloom/BloomCoarseGrainDataMapFactory.java
@@ -69,15 +69,15 @@ public class BloomCoarseGrainDataMapFactory extends DataMapFactory<CoarseGrainDa
    * default size for bloom filter, cardinality of the column.
    */
   private static final int DEFAULT_BLOOM_FILTER_SIZE =
-      CarbonV3DataFormatConstants.NUMBER_OF_ROWS_PER_BLOCKLET_COLUMN_PAGE_DEFAULT;
+      CarbonV3DataFormatConstants.NUMBER_OF_ROWS_PER_BLOCKLET_COLUMN_PAGE_DEFAULT * 20;
   /**
    * property for fpp(false-positive-probability) of bloom filter
    */
   private static final String BLOOM_FPP = "bloom_fpp";
   /**
-   * default value for fpp of bloom filter is 1%
+   * default value for fpp of bloom filter is 0.001%
    */
-  private static final double DEFAULT_BLOOM_FILTER_FPP = 0.01d;
+  private static final double DEFAULT_BLOOM_FILTER_FPP = 0.00001d;
 
   /**
    * property for compressing bloom while saving to disk.

http://git-wip-us.apache.org/repos/asf/carbondata/blob/6351c3a0/docs/datamap/bloomfilter-datamap-guide.md
----------------------------------------------------------------------
diff --git a/docs/datamap/bloomfilter-datamap-guide.md b/docs/datamap/bloomfilter-datamap-guide.md
index 325a508..2dba3dc 100644
--- a/docs/datamap/bloomfilter-datamap-guide.md
+++ b/docs/datamap/bloomfilter-datamap-guide.md
@@ -83,8 +83,8 @@ User can create BloomFilter datamap using the Create DataMap DDL:
 | Property | Is Required | Default Value | Description |
 |-------------|----------|--------|---------|
 | INDEX_COLUMNS | YES |  | Carbondata will generate BloomFilter index on these columns. Queries on there columns are usually like 'COL = VAL'. |
-| BLOOM_SIZE | NO | 32000 | This value is internally used by BloomFilter as the number of expected insertions, it will affects the size of BloomFilter index. Since each blocklet has a BloomFilter here, so the value is the approximate records in a blocklet. In another word, the value 32000 * #noOfPagesInBlocklet. The value should be an integer. |
-| BLOOM_FPP | NO | 0.01 | This value is internally used by BloomFilter as the False-Positive Probability, it will affects the size of bloomfilter index as well as the number of hash functions for the BloomFilter. The value should be in range (0, 1). |
+| BLOOM_SIZE | NO | 640000 | This value is internally used by BloomFilter as the number of expected insertions, it will affects the size of BloomFilter index. Since each blocklet has a BloomFilter here, so the default value is the approximate distinct index values in a blocklet assuming that each blocklet contains 20 pages and each page contains 32000 records. The value should be an integer. |
+| BLOOM_FPP | NO | 0.00001 | This value is internally used by BloomFilter as the False-Positive Probability, it will affects the size of bloomfilter index as well as the number of hash functions for the BloomFilter. The value should be in range (0, 1). In one test scenario, a 96GB TPCH customer table with bloom_size=320000 and bloom_fpp=0.00001 will result in 18 false positive samples. |
 | BLOOM_COMPRESS | NO | true | Whether to compress the BloomFilter index files. |