You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2020/09/22 21:01:27 UTC
[GitHub] [druid] jon-wei commented on a change in pull request #10288: Store hash partition function in dataSegment and allow segment pruning only when hash partition function is provided

jon-wei commented on a change in pull request #10288:
URL: https://github.com/apache/druid/pull/10288#discussion_r493025995



##########
File path: core/src/main/java/org/apache/druid/timeline/partition/HashBasedNumberedShardSpec.java
##########
@@ -98,141 +107,44 @@ public int getNumBuckets()
     return partitionDimensions;
   }
 
-  @Override
-  public List<String> getDomainDimensions()
+  @JsonProperty
+  public @Nullable HashPartitionFunction getPartitionFunction()
   {
-    return partitionDimensions;
+    return partitionFunction;
   }
 
   @Override
-  public boolean isInChunk(long timestamp, InputRow inputRow)
-  {
-    return getBucketIndex(hash(timestamp, inputRow), numBuckets) == bucketId % numBuckets;
-  }
-
-  /**
-   * Check if the current segment possibly holds records if the values of dimensions in {@link #partitionDimensions}
-   * are of {@code partitionDimensionsValues}
-   *
-   * @param partitionDimensionsValues An instance of values of dimensions in {@link #partitionDimensions}
-   *
-   * @return Whether the current segment possibly holds records for the given values of partition dimensions
-   */
-  private boolean isInChunk(Map<String, String> partitionDimensionsValues)
-  {
-    assert !partitionDimensions.isEmpty();
-    List<Object> groupKey = Lists.transform(
-        partitionDimensions,
-        o -> Collections.singletonList(partitionDimensionsValues.get(o))
-    );
-    try {
-      return getBucketIndex(hash(jsonMapper, groupKey), numBuckets) == bucketId % numBuckets;
-    }
-    catch (JsonProcessingException e) {
-      throw new RuntimeException(e);
-    }
-  }
-
-  /**
-   * This method calculates the hash based on whether {@param partitionDimensions} is null or not.
-   * If yes, then both {@param timestamp} and dimension columns in {@param inputRow} are used {@link Rows#toGroupKey}
-   * Or else, columns in {@param partitionDimensions} are used
-   *
-   * @param timestamp should be bucketed with query granularity
-   * @param inputRow row from input data
-   *
-   * @return hash value
-   */
-  protected int hash(long timestamp, InputRow inputRow)
-  {
-    return hash(jsonMapper, partitionDimensions, timestamp, inputRow);
-  }
-
-  public static int hash(ObjectMapper jsonMapper, List<String> partitionDimensions, long timestamp, InputRow inputRow)
-  {
-    final List<Object> groupKey = getGroupKey(partitionDimensions, timestamp, inputRow);
-    try {
-      return hash(jsonMapper, groupKey);
-    }
-    catch (JsonProcessingException e) {
-      throw new RuntimeException(e);
-    }
-  }
-
-  @VisibleForTesting
-  static List<Object> getGroupKey(final List<String> partitionDimensions, final long timestamp, final InputRow inputRow)
-  {
-    if (partitionDimensions.isEmpty()) {
-      return Rows.toGroupKey(timestamp, inputRow);
-    } else {
-      return Lists.transform(partitionDimensions, inputRow::getDimension);
-    }
-  }
-
-  @VisibleForTesting
-  public static int hash(ObjectMapper jsonMapper, List<Object> objects) throws JsonProcessingException
+  public List<String> getDomainDimensions()
   {
-    return HASH_FUNCTION.hashBytes(jsonMapper.writeValueAsBytes(objects)).asInt();
+    return partitionDimensions;
   }
 
   @Override
   public ShardSpecLookup getLookup(final List<? extends ShardSpec> shardSpecs)
   {
-    return createHashLookup(jsonMapper, partitionDimensions, shardSpecs, numBuckets);
-  }
-
-  static ShardSpecLookup createHashLookup(
-      ObjectMapper jsonMapper,
-      List<String> partitionDimensions,
-      List<? extends ShardSpec> shardSpecs,
-      int numBuckets
-  )
-  {
-    return (long timestamp, InputRow row) -> {
-      int index = getBucketIndex(hash(jsonMapper, partitionDimensions, timestamp, row), numBuckets);
-      return shardSpecs.get(index);
-    };
+    // partitionFunction can be null when you read a shardSpec of a segment created in an old version of Druid.
+    // It can never be null for segments to create during ingestion.

Review comment:
       suggest rewording the second line to something like "The current version of Druid will always specify a partitionFunction on newly created segments"

##########
File path: docs/ingestion/native-batch.md
##########
@@ -260,7 +260,7 @@ The three `partitionsSpec` types have different characteristics.
 | PartitionsSpec | Ingestion speed | Partitioning method | Supported rollup mode | Segment pruning at query time |
 |----------------|-----------------|---------------------|-----------------------|-------------------------------|
 | `dynamic` | Fastest  | Partitioning based on number of rows in segment. | Best-effort rollup | N/A |
-| `hashed`  | Moderate | Partitioning based on the hash value of partition dimensions. This partitioning may reduce your datasource size and query latency by improving data locality. See [Partitioning](./index.md#partitioning) for more details. | Perfect rollup | The broker can use the partition information to prune segments early to speed up queries if `partitionDimensions` is explicitly specified during ingestion. Since the broker knows how to hash `partitionDimensions` values to locate a segment, given a query including a filter on all the `partitionDimensions`, the broker can pick up only the segments holding the rows satisfying the filter on `partitionDimensions` for query processing. |
+| `hashed`  | Moderate | Partitioning based on the hash value of partition dimensions. This partitioning may reduce your datasource size and query latency by improving data locality. See [Partitioning](./index.md#partitioning) for more details. | Perfect rollup | The broker can use the partition information to prune segments early to speed up queries if `partitionDimensions` is explicitly specified during ingestion. Since the broker knows how to hash `partitionDimensions` values to locate a segment, given a query including a filter on all the `partitionDimensions`, the broker can pick up only the segments holding the rows satisfying the filter on `partitionDimensions` for query processing.<br/><br/>Note that `partitionDimensions` and `partitionFunction` must be set to enable segment pruning.|

Review comment:
       > `partitionFunction` must be set to enable segment pruning.
   
   Is it necessary to set `partitionFunction`? It's shown as optional in the tables below

##########
File path: docs/ingestion/index.md
##########
@@ -90,7 +90,7 @@ This table compares the three available options:
 | **Input locations** | Any [`inputSource`](./native-batch.md#input-sources). | Any Hadoop FileSystem or Druid datasource. | Any [`inputSource`](./native-batch.md#input-sources). |
 | **File formats** | Any [`inputFormat`](./data-formats.md#input-format). | Any Hadoop InputFormat. | Any [`inputFormat`](./data-formats.md#input-format). |
 | **[Rollup modes](#rollup)** | Perfect if `forceGuaranteedRollup` = true in the [`tuningConfig`](native-batch.md#tuningconfig).  | Always perfect. | Perfect if `forceGuaranteedRollup` = true in the [`tuningConfig`](native-batch.md#tuningconfig). |
-| **Partitioning options** | Dynamic, hash-based, and range-based partitioning methods are available. See [Partitions Spec](./native-batch.md#partitionsspec) for details. | Hash-based or range-based partitioning via [`partitionsSpec`](hadoop.md#partitionsspec). | Dynamic and hash-based partitioning methods are available. See [Partitions Spec](./native-batch.md#partitionsspec) for details. |
+| **Partitioning options** | Dynamic, hash-based, and range-based partitioning methods are available. See [Partitions Spec](./native-batch.md#partitionsspec) for details. | Hash-based or range-based partitioning via [`partitionsSpec`](hadoop.md#partitionsspec). | Dynamic and hash-based partitioning methods are available. See [Partitions Spec](./native-batch.md#partitionsspec-1) for details. |

Review comment:
       is the `-1` a typo here?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org