You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2020/08/14 22:47:40 UTC

[GitHub] [druid] suneet-s commented on a change in pull request #10243: Add maxNumFiles to splitHintSpec

suneet-s commented on a change in pull request #10243:
URL: https://github.com/apache/druid/pull/10243#discussion_r470889198



##########
File path: integration-tests/src/test/resources/indexer/wikipedia_parallel_ingest_segment_index_task.json
##########
@@ -61,7 +61,7 @@
       "forceGuaranteedRollup": "%%FORCE_GUARANTEED_ROLLUP%%",
       "splitHintSpec": {
         "type": "maxSize",
-        "maxSplitSize": 1

Review comment:
       Should we keep one of these tests to use `maxSplitSize` so that we know that the old specs continue to work

##########
File path: core/src/main/java/org/apache/druid/data/input/MaxSizeSplitHintSpec.java
##########
@@ -43,22 +44,53 @@
   public static final String TYPE = "maxSize";
 
   @VisibleForTesting
-  static final long DEFAULT_MAX_SPLIT_SIZE = 512 * 1024 * 1024;
+  static final HumanReadableBytes DEFAULT_MAX_SPLIT_SIZE = new HumanReadableBytes("512MiB");
 
-  private final long maxSplitSize;
+  /**
+   * There are two known issues when a split contains a large list of files.
+   *
+   * - 'jute.maxbuffer' in ZooKeeper. This system property controls the max size of ZNode. As its default is 500KB,
+   *   task allocation can fail if the serialized ingestion spec is larger than this limit.
+   * - 'max_allowed_packet' in MySQL. This is the max size of a communication packet sent to a MySQL server.
+   *   The default is either 64MB or 4MB depending on MySQL version. Updating metadata store can fail if the serialized
+   *   ingestion spec is larger than this limit.
+   *
+   * The default is consertively chosen as 1000.
+   */

Review comment:
       This is a great comment! Thank you!

##########
File path: docs/ingestion/native-batch.md
##########
@@ -232,7 +232,8 @@ The size-based split hint spec is respected by all splittable input sources exce
 |property|description|default|required?|
 |--------|-----------|-------|---------|
 |type|This should always be `maxSize`.|none|yes|
-|maxSplitSize|Maximum number of bytes of input files to process in a single task. If a single file is larger than this number, it will be processed by itself in a single task (Files are never split across tasks yet).|500MB|no|
+|maxSplitSize|Maximum number of bytes of input files to process in a single task. If a single file is larger than this number, it will be processed by itself in a single task (Files are never split across tasks yet). [Human-readable format](../configuration/human-readable-byte.md) is supported.|500MiB|no|
+|maxNumFiles|Maximum number of input files to process in a single task. This limit is to avoid task failures when the ingestion spec is too long. There are two known limits on the max size of serialized ingestion spec, i.e., the max ZNode size in ZooKeeper (`jute.maxbuffer`) and the max packet size in MySQL (`max_allowed_packet`). These can make ingestion tasks fail if the serialized ingestion spec size hits one of them.|1000|no|

Review comment:
       Is there a pattern to document the relationship between these 2 configs. 
   
   Reading these docs I'm not sure if I need to consider one when setting the other or which one takes precedence.
   

##########
File path: core/src/main/java/org/apache/druid/data/input/MaxSizeSplitHintSpec.java
##########
@@ -43,22 +44,53 @@
   public static final String TYPE = "maxSize";
 
   @VisibleForTesting
-  static final long DEFAULT_MAX_SPLIT_SIZE = 512 * 1024 * 1024;
+  static final HumanReadableBytes DEFAULT_MAX_SPLIT_SIZE = new HumanReadableBytes("512MiB");
 
-  private final long maxSplitSize;
+  /**
+   * There are two known issues when a split contains a large list of files.
+   *
+   * - 'jute.maxbuffer' in ZooKeeper. This system property controls the max size of ZNode. As its default is 500KB,
+   *   task allocation can fail if the serialized ingestion spec is larger than this limit.
+   * - 'max_allowed_packet' in MySQL. This is the max size of a communication packet sent to a MySQL server.
+   *   The default is either 64MB or 4MB depending on MySQL version. Updating metadata store can fail if the serialized
+   *   ingestion spec is larger than this limit.
+   *
+   * The default is consertively chosen as 1000.
+   */
+  @VisibleForTesting
+  static final int DEFAULT_MAX_NUM_FILES = 1000;
+
+  private final HumanReadableBytes maxSplitSize;
+  private final int maxNumFiles;
 
   @JsonCreator
-  public MaxSizeSplitHintSpec(@JsonProperty("maxSplitSize") @Nullable Long maxSplitSize)
+  public MaxSizeSplitHintSpec(
+      @JsonProperty("maxSplitSize") @Nullable HumanReadableBytes maxSplitSize,
+      @JsonProperty("maxNumFiles") @Nullable Integer maxNumFiles
+  )
   {
     this.maxSplitSize = maxSplitSize == null ? DEFAULT_MAX_SPLIT_SIZE : maxSplitSize;
+    this.maxNumFiles = maxNumFiles == null ? DEFAULT_MAX_NUM_FILES : maxNumFiles;

Review comment:
       Should we add a preconditionCheck on maxNumFiles? If the user has entered maxNumFiles, it should be a positive number >= 1
   
   I think what you've implemented works even with negative numbers, but maybe it's better to tell the user they're doing something strange.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org