You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2020/08/19 21:00:46 UTC

[GitHub] [druid] clintropolis commented on a change in pull request #10243: Add maxNumFiles to splitHintSpec

clintropolis commented on a change in pull request #10243:
URL: https://github.com/apache/druid/pull/10243#discussion_r473313243



##########
File path: docs/ingestion/native-batch.md
##########
@@ -232,7 +232,8 @@ The size-based split hint spec is respected by all splittable input sources exce
 |property|description|default|required?|
 |--------|-----------|-------|---------|
 |type|This should always be `maxSize`.|none|yes|
-|maxSplitSize|Maximum number of bytes of input files to process in a single task. If a single file is larger than this number, it will be processed by itself in a single task (Files are never split across tasks yet).|500MB|no|
+|maxSplitSize|Maximum number of bytes of input files to process in a single task. If a single file is larger than this number, it will be processed by itself in a single task (Files are never split across tasks yet). Noe that one subtask will not process more files than `maxNumFiles` even if their total size is smaller than `maxSplitSize`. [Human-readable format](../configuration/human-readable-byte.md) is supported.|1GiB|no|

Review comment:
       typo: 'Noe' -> 'Note'

##########
File path: core/src/main/java/org/apache/druid/data/input/MaxSizeSplitHintSpec.java
##########
@@ -43,22 +45,55 @@
   public static final String TYPE = "maxSize";
 
   @VisibleForTesting
-  static final long DEFAULT_MAX_SPLIT_SIZE = 512 * 1024 * 1024;
+  static final HumanReadableBytes DEFAULT_MAX_SPLIT_SIZE = new HumanReadableBytes("1GiB");
 
-  private final long maxSplitSize;
+  /**
+   * There are two known issues when a split contains a large list of files.
+   *
+   * - 'jute.maxbuffer' in ZooKeeper. This system property controls the max size of ZNode. As its default is 500KB,
+   *   task allocation can fail if the serialized ingestion spec is larger than this limit.
+   * - 'max_allowed_packet' in MySQL. This is the max size of a communication packet sent to a MySQL server.
+   *   The default is either 64MB or 4MB depending on MySQL version. Updating metadata store can fail if the serialized
+   *   ingestion spec is larger than this limit.
+   *
+   * The default is consertively chosen as 1000.

Review comment:
       is this a typo: 'consertively' -> 'conservatively'?

##########
File path: docs/ingestion/native-batch.md
##########
@@ -232,7 +232,8 @@ The size-based split hint spec is respected by all splittable input sources exce
 |property|description|default|required?|
 |--------|-----------|-------|---------|
 |type|This should always be `maxSize`.|none|yes|
-|maxSplitSize|Maximum number of bytes of input files to process in a single task. If a single file is larger than this number, it will be processed by itself in a single task (Files are never split across tasks yet).|500MB|no|
+|maxSplitSize|Maximum number of bytes of input files to process in a single task. If a single file is larger than this number, it will be processed by itself in a single task (Files are never split across tasks yet). Noe that one subtask will not process more files than `maxNumFiles` even if their total size is smaller than `maxSplitSize`. [Human-readable format](../configuration/human-readable-byte.md) is supported.|1GiB|no|
+|maxNumFiles|Maximum number of input files to process in a single task. This limit is to avoid task failures when the ingestion spec is too long. There are two known limits on the max size of serialized ingestion spec, i.e., the max ZNode size in ZooKeeper (`jute.maxbuffer`) and the max packet size in MySQL (`max_allowed_packet`). These can make ingestion tasks fail if the serialized ingestion spec size hits one of them. Note that one subtask will not process more data than `maxSplitSize` even if the total number of files is smaller than `maxNumFiles`.|1000|no|

Review comment:
       Does this limit apply to the entire parallel task, just the subtasks, or both? It isn't super clear from the docs here, though from my interpretation of the code it looks like this applies to subtasks?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org