You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2020/08/05 18:30:06 UTC

[GitHub] [druid] jihoonson opened a new pull request #10243: Add maxNumFiles to splitHintSpec

jihoonson opened a new pull request #10243:
URL: https://github.com/apache/druid/pull/10243


   ### Description
   
   This PR adds `maxNumFiles` to `maxSize` splitHintSpec which is a new limit on the max number of files in a split. Also, the human readable format is now supported for `maxSplitSize`.
   
   <hr>
   
   This PR has:
   - [x] been self-reviewed.
      - [ ] using the [concurrency checklist](https://github.com/apache/druid/blob/master/dev/code-review/concurrency.md) (Remove this item if the PR doesn't have any relation to concurrency.)
   - [ ] added documentation for new or modified features or behaviors.
   - [x] added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
   - [ ] added or updated version, license, or notice information in [licenses.yaml](https://github.com/apache/druid/blob/master/licenses.yaml)
   - [ ] added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
   - [x] added unit tests or modified existing tests to cover new code paths, ensuring the threshold for [code coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md) is met.
   - [ ] added integration tests.
   - [ ] been tested in a test Druid cluster.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] jihoonson commented on a change in pull request #10243: Add maxNumFiles to splitHintSpec

Posted by GitBox <gi...@apache.org>.

jihoonson commented on a change in pull request #10243:
URL: https://github.com/apache/druid/pull/10243#discussion_r472545452



##########
File path: core/src/main/java/org/apache/druid/data/input/MaxSizeSplitHintSpec.java
##########
@@ -44,7 +45,7 @@
   public static final String TYPE = "maxSize";
 
   @VisibleForTesting
-  static final HumanReadableBytes DEFAULT_MAX_SPLIT_SIZE = new HumanReadableBytes("512MiB");
+  static final HumanReadableBytes DEFAULT_MAX_SPLIT_SIZE = new HumanReadableBytes("1GiB");

Review comment:
       Hmm, good question. I would say it could be more useful to have a way to apply different default configurations per datasource since the `maxSplitSize` should be adjusted based on the shape of input data and partitioning scheme of output data. But for this, I think it could be better to add a supervisor which periodically performs batch ingestion based on the user-provided configurations.
   
   Particularly regarding keeping the previous default, I'm not sure when it would be good to do. `maxSplitSize` is mostly for controlling the parallelism of the phase which reads data from inputSource in parallel indexing, but it also affects the number of segments created after the input-read phase. So, there is a trade-off between them. However, I would say increasing maxSplitSize 512 MB to 1 GB wouldn't change things dramatically. If you have a cluster where all subtasks split by the previous default can run at the same time but not with the new default, you might want to use the previous default because, in theory, it will give you 2 times better read performance. However, in practice, you would likely have more than one task to run at the same time, which the cluster resource should be shared across.
   
   What do you think?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] jihoonson commented on a change in pull request #10243: Add maxNumFiles to splitHintSpec

Posted by GitBox <gi...@apache.org>.

jihoonson commented on a change in pull request #10243:
URL: https://github.com/apache/druid/pull/10243#discussion_r471863374



##########
File path: integration-tests/src/test/resources/indexer/wikipedia_parallel_ingest_segment_index_task.json
##########
@@ -61,7 +61,7 @@
       "forceGuaranteedRollup": "%%FORCE_GUARANTEED_ROLLUP%%",
       "splitHintSpec": {
         "type": "maxSize",
-        "maxSplitSize": 1

Review comment:
       :+1: Changed `wikipedia_parallel_index_task.json` back to use `maxSplitSize`.

##########
File path: core/src/main/java/org/apache/druid/data/input/MaxSizeSplitHintSpec.java
##########
@@ -43,22 +44,53 @@
   public static final String TYPE = "maxSize";
 
   @VisibleForTesting
-  static final long DEFAULT_MAX_SPLIT_SIZE = 512 * 1024 * 1024;
+  static final HumanReadableBytes DEFAULT_MAX_SPLIT_SIZE = new HumanReadableBytes("512MiB");
 
-  private final long maxSplitSize;
+  /**
+   * There are two known issues when a split contains a large list of files.
+   *
+   * - 'jute.maxbuffer' in ZooKeeper. This system property controls the max size of ZNode. As its default is 500KB,
+   *   task allocation can fail if the serialized ingestion spec is larger than this limit.
+   * - 'max_allowed_packet' in MySQL. This is the max size of a communication packet sent to a MySQL server.
+   *   The default is either 64MB or 4MB depending on MySQL version. Updating metadata store can fail if the serialized
+   *   ingestion spec is larger than this limit.
+   *
+   * The default is consertively chosen as 1000.
+   */
+  @VisibleForTesting
+  static final int DEFAULT_MAX_NUM_FILES = 1000;
+
+  private final HumanReadableBytes maxSplitSize;
+  private final int maxNumFiles;
 
   @JsonCreator
-  public MaxSizeSplitHintSpec(@JsonProperty("maxSplitSize") @Nullable Long maxSplitSize)
+  public MaxSizeSplitHintSpec(
+      @JsonProperty("maxSplitSize") @Nullable HumanReadableBytes maxSplitSize,
+      @JsonProperty("maxNumFiles") @Nullable Integer maxNumFiles
+  )
   {
     this.maxSplitSize = maxSplitSize == null ? DEFAULT_MAX_SPLIT_SIZE : maxSplitSize;
+    this.maxNumFiles = maxNumFiles == null ? DEFAULT_MAX_NUM_FILES : maxNumFiles;

Review comment:
       Good call. Added a check for both maxSplitsize and maxNumFiles.

##########
File path: docs/ingestion/native-batch.md
##########
@@ -232,7 +232,8 @@ The size-based split hint spec is respected by all splittable input sources exce
 |property|description|default|required?|
 |--------|-----------|-------|---------|
 |type|This should always be `maxSize`.|none|yes|
-|maxSplitSize|Maximum number of bytes of input files to process in a single task. If a single file is larger than this number, it will be processed by itself in a single task (Files are never split across tasks yet).|500MB|no|
+|maxSplitSize|Maximum number of bytes of input files to process in a single task. If a single file is larger than this number, it will be processed by itself in a single task (Files are never split across tasks yet). [Human-readable format](../configuration/human-readable-byte.md) is supported.|500MiB|no|
+|maxNumFiles|Maximum number of input files to process in a single task. This limit is to avoid task failures when the ingestion spec is too long. There are two known limits on the max size of serialized ingestion spec, i.e., the max ZNode size in ZooKeeper (`jute.maxbuffer`) and the max packet size in MySQL (`max_allowed_packet`). These can make ingestion tasks fail if the serialized ingestion spec size hits one of them.|1000|no|

Review comment:
       Thanks, they don't take precedence over each other, but split boundary will be determined when either of them hits. Added some more clarification.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] jihoonson commented on pull request #10243: Add maxNumFiles to splitHintSpec

Posted by GitBox <gi...@apache.org>.

jihoonson commented on pull request #10243:
URL: https://github.com/apache/druid/pull/10243#issuecomment-675199406


   @suneet-s thanks for the review. I also changed the default of `maxSplitSize` to 1 GB. 500 MB seems too small based on my recent experience.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] suneet-s commented on a change in pull request #10243: Add maxNumFiles to splitHintSpec

Posted by GitBox <gi...@apache.org>.

suneet-s commented on a change in pull request #10243:
URL: https://github.com/apache/druid/pull/10243#discussion_r472687669



##########
File path: core/src/main/java/org/apache/druid/data/input/MaxSizeSplitHintSpec.java
##########
@@ -44,7 +45,7 @@
   public static final String TYPE = "maxSize";
 
   @VisibleForTesting
-  static final HumanReadableBytes DEFAULT_MAX_SPLIT_SIZE = new HumanReadableBytes("512MiB");
+  static final HumanReadableBytes DEFAULT_MAX_SPLIT_SIZE = new HumanReadableBytes("1GiB");

Review comment:
       Thinking about this more, I can't think of a good reason to have a cluster wide fallback option, other than my paranoia 😅  
   
   Your analysis makes sense to me, and while adding the ability to automatically adjust this based on the shape and partitioning scheme of the data sounds really cool, it's definitely beyond the scope of this change




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] suneet-s commented on a change in pull request #10243: Add maxNumFiles to splitHintSpec

Posted by GitBox <gi...@apache.org>.

suneet-s commented on a change in pull request #10243:
URL: https://github.com/apache/druid/pull/10243#discussion_r471944448



##########
File path: core/src/main/java/org/apache/druid/data/input/MaxSizeSplitHintSpec.java
##########
@@ -44,7 +45,7 @@
   public static final String TYPE = "maxSize";
 
   @VisibleForTesting
-  static final HumanReadableBytes DEFAULT_MAX_SPLIT_SIZE = new HumanReadableBytes("512MiB");
+  static final HumanReadableBytes DEFAULT_MAX_SPLIT_SIZE = new HumanReadableBytes("1GiB");

Review comment:
       This change seems reasonable to me, but since we're changing the default behavior from 512 MiB to 1GiB, can we add a system property to override this default value. In case users are doing an upgrade to 0.20 and want the old behavior, the system property would give them a way. to have the previous default.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] clintropolis commented on a change in pull request #10243: Add maxNumFiles to splitHintSpec

Posted by GitBox <gi...@apache.org>.

clintropolis commented on a change in pull request #10243:
URL: https://github.com/apache/druid/pull/10243#discussion_r473313243



##########
File path: docs/ingestion/native-batch.md
##########
@@ -232,7 +232,8 @@ The size-based split hint spec is respected by all splittable input sources exce
 |property|description|default|required?|
 |--------|-----------|-------|---------|
 |type|This should always be `maxSize`.|none|yes|
-|maxSplitSize|Maximum number of bytes of input files to process in a single task. If a single file is larger than this number, it will be processed by itself in a single task (Files are never split across tasks yet).|500MB|no|
+|maxSplitSize|Maximum number of bytes of input files to process in a single task. If a single file is larger than this number, it will be processed by itself in a single task (Files are never split across tasks yet). Noe that one subtask will not process more files than `maxNumFiles` even if their total size is smaller than `maxSplitSize`. [Human-readable format](../configuration/human-readable-byte.md) is supported.|1GiB|no|

Review comment:
       typo: 'Noe' -> 'Note'

##########
File path: core/src/main/java/org/apache/druid/data/input/MaxSizeSplitHintSpec.java
##########
@@ -43,22 +45,55 @@
   public static final String TYPE = "maxSize";
 
   @VisibleForTesting
-  static final long DEFAULT_MAX_SPLIT_SIZE = 512 * 1024 * 1024;
+  static final HumanReadableBytes DEFAULT_MAX_SPLIT_SIZE = new HumanReadableBytes("1GiB");
 
-  private final long maxSplitSize;
+  /**
+   * There are two known issues when a split contains a large list of files.
+   *
+   * - 'jute.maxbuffer' in ZooKeeper. This system property controls the max size of ZNode. As its default is 500KB,
+   *   task allocation can fail if the serialized ingestion spec is larger than this limit.
+   * - 'max_allowed_packet' in MySQL. This is the max size of a communication packet sent to a MySQL server.
+   *   The default is either 64MB or 4MB depending on MySQL version. Updating metadata store can fail if the serialized
+   *   ingestion spec is larger than this limit.
+   *
+   * The default is consertively chosen as 1000.

Review comment:
       is this a typo: 'consertively' -> 'conservatively'?

##########
File path: docs/ingestion/native-batch.md
##########
@@ -232,7 +232,8 @@ The size-based split hint spec is respected by all splittable input sources exce
 |property|description|default|required?|
 |--------|-----------|-------|---------|
 |type|This should always be `maxSize`.|none|yes|
-|maxSplitSize|Maximum number of bytes of input files to process in a single task. If a single file is larger than this number, it will be processed by itself in a single task (Files are never split across tasks yet).|500MB|no|
+|maxSplitSize|Maximum number of bytes of input files to process in a single task. If a single file is larger than this number, it will be processed by itself in a single task (Files are never split across tasks yet). Noe that one subtask will not process more files than `maxNumFiles` even if their total size is smaller than `maxSplitSize`. [Human-readable format](../configuration/human-readable-byte.md) is supported.|1GiB|no|
+|maxNumFiles|Maximum number of input files to process in a single task. This limit is to avoid task failures when the ingestion spec is too long. There are two known limits on the max size of serialized ingestion spec, i.e., the max ZNode size in ZooKeeper (`jute.maxbuffer`) and the max packet size in MySQL (`max_allowed_packet`). These can make ingestion tasks fail if the serialized ingestion spec size hits one of them. Note that one subtask will not process more data than `maxSplitSize` even if the total number of files is smaller than `maxNumFiles`.|1000|no|

Review comment:
       Does this limit apply to the entire parallel task, just the subtasks, or both? It isn't super clear from the docs here, though from my interpretation of the code it looks like this applies to subtasks?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] jihoonson commented on a change in pull request #10243: Add maxNumFiles to splitHintSpec

Posted by GitBox <gi...@apache.org>.

jihoonson commented on a change in pull request #10243:
URL: https://github.com/apache/druid/pull/10243#discussion_r473174787



##########
File path: core/src/main/java/org/apache/druid/data/input/MaxSizeSplitHintSpec.java
##########
@@ -44,7 +45,7 @@
   public static final String TYPE = "maxSize";
 
   @VisibleForTesting
-  static final HumanReadableBytes DEFAULT_MAX_SPLIT_SIZE = new HumanReadableBytes("512MiB");
+  static final HumanReadableBytes DEFAULT_MAX_SPLIT_SIZE = new HumanReadableBytes("1GiB");

Review comment:
       > Your analysis makes sense to me, and while adding the ability to automatically adjust this based on the shape and partitioning scheme of the data sounds really cool, it's definitely beyond the scope of this change
   
   Oh, I originally meant a supervisor working based on fixed user configurations, but yeah auto adjusting would be nice.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] jihoonson commented on a change in pull request #10243: Add maxNumFiles to splitHintSpec

Posted by GitBox <gi...@apache.org>.

jihoonson commented on a change in pull request #10243:
URL: https://github.com/apache/druid/pull/10243#discussion_r474171560



##########
File path: core/src/main/java/org/apache/druid/data/input/MaxSizeSplitHintSpec.java
##########
@@ -43,22 +45,55 @@
   public static final String TYPE = "maxSize";
 
   @VisibleForTesting
-  static final long DEFAULT_MAX_SPLIT_SIZE = 512 * 1024 * 1024;
+  static final HumanReadableBytes DEFAULT_MAX_SPLIT_SIZE = new HumanReadableBytes("1GiB");
 
-  private final long maxSplitSize;
+  /**
+   * There are two known issues when a split contains a large list of files.
+   *
+   * - 'jute.maxbuffer' in ZooKeeper. This system property controls the max size of ZNode. As its default is 500KB,
+   *   task allocation can fail if the serialized ingestion spec is larger than this limit.
+   * - 'max_allowed_packet' in MySQL. This is the max size of a communication packet sent to a MySQL server.
+   *   The default is either 64MB or 4MB depending on MySQL version. Updating metadata store can fail if the serialized
+   *   ingestion spec is larger than this limit.
+   *
+   * The default is conservatively chosen as 1000.
+   */
+  @VisibleForTesting
+  static final int DEFAULT_MAX_NUM_FILES = 1000;
+
+  private final HumanReadableBytes maxSplitSize;
+  private final int maxNumFiles;
 
   @JsonCreator
-  public MaxSizeSplitHintSpec(@JsonProperty("maxSplitSize") @Nullable Long maxSplitSize)
+  public MaxSizeSplitHintSpec(
+      @JsonProperty("maxSplitSize") @Nullable HumanReadableBytes maxSplitSize,
+      @JsonProperty("maxNumFiles") @Nullable Integer maxNumFiles
+  )
   {
     this.maxSplitSize = maxSplitSize == null ? DEFAULT_MAX_SPLIT_SIZE : maxSplitSize;
+    this.maxNumFiles = maxNumFiles == null ? DEFAULT_MAX_NUM_FILES : maxNumFiles;
+    Preconditions.checkArgument(this.maxSplitSize.getBytes() > 0, "maxSplitSize should be larger than 0");
+    Preconditions.checkArgument(this.maxNumFiles > 0, "maxNumFiles should be larger than 0");
+  }
+
+  @VisibleForTesting
+  public MaxSizeSplitHintSpec(long maxSplitSize, @Nullable Integer maxNumFiles)
+  {
+    this(new HumanReadableBytes(maxSplitSize), maxNumFiles);
   }
 
   @JsonProperty
-  public long getMaxSplitSize()
+  public HumanReadableBytes getMaxSplitSize()

Review comment:
       Ah good idea. It actually needs the same change because it needs to be identical to `maxSize` splitHintSpec at least for now. We could implement a better split based on stats in the future.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] suneet-s commented on a change in pull request #10243: Add maxNumFiles to splitHintSpec

Posted by GitBox <gi...@apache.org>.

suneet-s commented on a change in pull request #10243:
URL: https://github.com/apache/druid/pull/10243#discussion_r471944448



##########
File path: core/src/main/java/org/apache/druid/data/input/MaxSizeSplitHintSpec.java
##########
@@ -44,7 +45,7 @@
   public static final String TYPE = "maxSize";
 
   @VisibleForTesting
-  static final HumanReadableBytes DEFAULT_MAX_SPLIT_SIZE = new HumanReadableBytes("512MiB");
+  static final HumanReadableBytes DEFAULT_MAX_SPLIT_SIZE = new HumanReadableBytes("1GiB");

Review comment:
       Since we're changing the default behavior from 512 MiB to 1GiB, can we add a system property to override this default value. In case users are doing an upgrade to 0.20 and want the old behavior, the system property would give them a way. to have the previous default.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] jihoonson commented on a change in pull request #10243: Add maxNumFiles to splitHintSpec

Posted by GitBox <gi...@apache.org>.

jihoonson commented on a change in pull request #10243:
URL: https://github.com/apache/druid/pull/10243#discussion_r473691192



##########
File path: core/src/main/java/org/apache/druid/data/input/MaxSizeSplitHintSpec.java
##########
@@ -43,22 +45,55 @@
   public static final String TYPE = "maxSize";
 
   @VisibleForTesting
-  static final long DEFAULT_MAX_SPLIT_SIZE = 512 * 1024 * 1024;
+  static final HumanReadableBytes DEFAULT_MAX_SPLIT_SIZE = new HumanReadableBytes("1GiB");
 
-  private final long maxSplitSize;
+  /**
+   * There are two known issues when a split contains a large list of files.
+   *
+   * - 'jute.maxbuffer' in ZooKeeper. This system property controls the max size of ZNode. As its default is 500KB,
+   *   task allocation can fail if the serialized ingestion spec is larger than this limit.
+   * - 'max_allowed_packet' in MySQL. This is the max size of a communication packet sent to a MySQL server.
+   *   The default is either 64MB or 4MB depending on MySQL version. Updating metadata store can fail if the serialized
+   *   ingestion spec is larger than this limit.
+   *
+   * The default is consertively chosen as 1000.

Review comment:
       Oops, thanks.

##########
File path: docs/ingestion/native-batch.md
##########
@@ -232,7 +232,8 @@ The size-based split hint spec is respected by all splittable input sources exce
 |property|description|default|required?|
 |--------|-----------|-------|---------|
 |type|This should always be `maxSize`.|none|yes|
-|maxSplitSize|Maximum number of bytes of input files to process in a single task. If a single file is larger than this number, it will be processed by itself in a single task (Files are never split across tasks yet).|500MB|no|
+|maxSplitSize|Maximum number of bytes of input files to process in a single task. If a single file is larger than this number, it will be processed by itself in a single task (Files are never split across tasks yet). Noe that one subtask will not process more files than `maxNumFiles` even if their total size is smaller than `maxSplitSize`. [Human-readable format](../configuration/human-readable-byte.md) is supported.|1GiB|no|

Review comment:
       Thanks, fixed.

##########
File path: docs/ingestion/native-batch.md
##########
@@ -232,7 +232,8 @@ The size-based split hint spec is respected by all splittable input sources exce
 |property|description|default|required?|
 |--------|-----------|-------|---------|
 |type|This should always be `maxSize`.|none|yes|
-|maxSplitSize|Maximum number of bytes of input files to process in a single task. If a single file is larger than this number, it will be processed by itself in a single task (Files are never split across tasks yet).|500MB|no|
+|maxSplitSize|Maximum number of bytes of input files to process in a single task. If a single file is larger than this number, it will be processed by itself in a single task (Files are never split across tasks yet). Noe that one subtask will not process more files than `maxNumFiles` even if their total size is smaller than `maxSplitSize`. [Human-readable format](../configuration/human-readable-byte.md) is supported.|1GiB|no|
+|maxNumFiles|Maximum number of input files to process in a single task. This limit is to avoid task failures when the ingestion spec is too long. There are two known limits on the max size of serialized ingestion spec, i.e., the max ZNode size in ZooKeeper (`jute.maxbuffer`) and the max packet size in MySQL (`max_allowed_packet`). These can make ingestion tasks fail if the serialized ingestion spec size hits one of them. Note that one subtask will not process more data than `maxSplitSize` even if the total number of files is smaller than `maxNumFiles`.|1000|no|

Review comment:
       Hmm, it is applied to only subtasks. I changed `single task` to `single subtask`. Is it better?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] clintropolis commented on a change in pull request #10243: Add maxNumFiles to splitHintSpec

Posted by GitBox <gi...@apache.org>.

clintropolis commented on a change in pull request #10243:
URL: https://github.com/apache/druid/pull/10243#discussion_r473797085



##########
File path: core/src/main/java/org/apache/druid/data/input/MaxSizeSplitHintSpec.java
##########
@@ -43,22 +45,55 @@
   public static final String TYPE = "maxSize";
 
   @VisibleForTesting
-  static final long DEFAULT_MAX_SPLIT_SIZE = 512 * 1024 * 1024;
+  static final HumanReadableBytes DEFAULT_MAX_SPLIT_SIZE = new HumanReadableBytes("1GiB");
 
-  private final long maxSplitSize;
+  /**
+   * There are two known issues when a split contains a large list of files.
+   *
+   * - 'jute.maxbuffer' in ZooKeeper. This system property controls the max size of ZNode. As its default is 500KB,
+   *   task allocation can fail if the serialized ingestion spec is larger than this limit.
+   * - 'max_allowed_packet' in MySQL. This is the max size of a communication packet sent to a MySQL server.
+   *   The default is either 64MB or 4MB depending on MySQL version. Updating metadata store can fail if the serialized
+   *   ingestion spec is larger than this limit.
+   *
+   * The default is conservatively chosen as 1000.
+   */
+  @VisibleForTesting
+  static final int DEFAULT_MAX_NUM_FILES = 1000;
+
+  private final HumanReadableBytes maxSplitSize;
+  private final int maxNumFiles;
 
   @JsonCreator
-  public MaxSizeSplitHintSpec(@JsonProperty("maxSplitSize") @Nullable Long maxSplitSize)
+  public MaxSizeSplitHintSpec(
+      @JsonProperty("maxSplitSize") @Nullable HumanReadableBytes maxSplitSize,
+      @JsonProperty("maxNumFiles") @Nullable Integer maxNumFiles
+  )
   {
     this.maxSplitSize = maxSplitSize == null ? DEFAULT_MAX_SPLIT_SIZE : maxSplitSize;
+    this.maxNumFiles = maxNumFiles == null ? DEFAULT_MAX_NUM_FILES : maxNumFiles;
+    Preconditions.checkArgument(this.maxSplitSize.getBytes() > 0, "maxSplitSize should be larger than 0");
+    Preconditions.checkArgument(this.maxNumFiles > 0, "maxNumFiles should be larger than 0");
+  }
+
+  @VisibleForTesting
+  public MaxSizeSplitHintSpec(long maxSplitSize, @Nullable Integer maxNumFiles)
+  {
+    this(new HumanReadableBytes(maxSplitSize), maxNumFiles);
   }
 
   @JsonProperty
-  public long getMaxSplitSize()
+  public HumanReadableBytes getMaxSplitSize()

Review comment:
       Should `SegmentsSplitHintSpec` be updated to accept `HumanReadableBytes` for `maxInputSegmentBytesPerTask` as well?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] abhishekagarwal87 commented on a change in pull request #10243: Add maxNumFiles to splitHintSpec

Posted by GitBox <gi...@apache.org>.

abhishekagarwal87 commented on a change in pull request #10243:
URL: https://github.com/apache/druid/pull/10243#discussion_r470734118



##########
File path: docs/ingestion/native-batch.md
##########
@@ -232,7 +232,8 @@ The size-based split hint spec is respected by all splittable input sources exce
 |property|description|default|required?|
 |--------|-----------|-------|---------|
 |type|This should always be `maxSize`.|none|yes|
-|maxSplitSize|Maximum number of bytes of input files to process in a single task. If a single file is larger than this number, it will be processed by itself in a single task (Files are never split across tasks yet).|500MB|no|
+|maxSplitSize|Maximum number of bytes of input files to process in a single task. If a single file is larger than this number, it will be processed by itself in a single task (Files are never split across tasks yet). [Human-readable format](../configuration/human-readable-byte.md) is supported.|500MiB|no|
+|maxNumFiles|Maximum number of input files to process in a single task. This limit is to avoid task failures when the ingestion spec is too long. There are two known limits on the max size of serialized ingestion spec, i.e., the max ZNode size in ZooKeeper (`jute.maxbuffer`) and the max packet size in MySQL (`max_allowed_packet`). These can make ingestion tasks fail if the serialized ingestion spec size hits one of them.|5000|no|

Review comment:
       ```suggestion
   |maxNumFiles|Maximum number of input files to process in a single task. This limit is to avoid task failures when the ingestion spec is too long. There are two known limits on the max size of serialized ingestion spec, i.e., the max ZNode size in ZooKeeper (`jute.maxbuffer`) and the max packet size in MySQL (`max_allowed_packet`). These can make ingestion tasks fail if the serialized ingestion spec size hits one of them.|1000|no|
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] suneet-s commented on a change in pull request #10243: Add maxNumFiles to splitHintSpec

Posted by GitBox <gi...@apache.org>.

suneet-s commented on a change in pull request #10243:
URL: https://github.com/apache/druid/pull/10243#discussion_r470889198



##########
File path: integration-tests/src/test/resources/indexer/wikipedia_parallel_ingest_segment_index_task.json
##########
@@ -61,7 +61,7 @@
       "forceGuaranteedRollup": "%%FORCE_GUARANTEED_ROLLUP%%",
       "splitHintSpec": {
         "type": "maxSize",
-        "maxSplitSize": 1

Review comment:
       Should we keep one of these tests to use `maxSplitSize` so that we know that the old specs continue to work

##########
File path: core/src/main/java/org/apache/druid/data/input/MaxSizeSplitHintSpec.java
##########
@@ -43,22 +44,53 @@
   public static final String TYPE = "maxSize";
 
   @VisibleForTesting
-  static final long DEFAULT_MAX_SPLIT_SIZE = 512 * 1024 * 1024;
+  static final HumanReadableBytes DEFAULT_MAX_SPLIT_SIZE = new HumanReadableBytes("512MiB");
 
-  private final long maxSplitSize;
+  /**
+   * There are two known issues when a split contains a large list of files.
+   *
+   * - 'jute.maxbuffer' in ZooKeeper. This system property controls the max size of ZNode. As its default is 500KB,
+   *   task allocation can fail if the serialized ingestion spec is larger than this limit.
+   * - 'max_allowed_packet' in MySQL. This is the max size of a communication packet sent to a MySQL server.
+   *   The default is either 64MB or 4MB depending on MySQL version. Updating metadata store can fail if the serialized
+   *   ingestion spec is larger than this limit.
+   *
+   * The default is consertively chosen as 1000.
+   */

Review comment:
       This is a great comment! Thank you!

##########
File path: docs/ingestion/native-batch.md
##########
@@ -232,7 +232,8 @@ The size-based split hint spec is respected by all splittable input sources exce
 |property|description|default|required?|
 |--------|-----------|-------|---------|
 |type|This should always be `maxSize`.|none|yes|
-|maxSplitSize|Maximum number of bytes of input files to process in a single task. If a single file is larger than this number, it will be processed by itself in a single task (Files are never split across tasks yet).|500MB|no|
+|maxSplitSize|Maximum number of bytes of input files to process in a single task. If a single file is larger than this number, it will be processed by itself in a single task (Files are never split across tasks yet). [Human-readable format](../configuration/human-readable-byte.md) is supported.|500MiB|no|
+|maxNumFiles|Maximum number of input files to process in a single task. This limit is to avoid task failures when the ingestion spec is too long. There are two known limits on the max size of serialized ingestion spec, i.e., the max ZNode size in ZooKeeper (`jute.maxbuffer`) and the max packet size in MySQL (`max_allowed_packet`). These can make ingestion tasks fail if the serialized ingestion spec size hits one of them.|1000|no|

Review comment:
       Is there a pattern to document the relationship between these 2 configs. 
   
   Reading these docs I'm not sure if I need to consider one when setting the other or which one takes precedence.
   

##########
File path: core/src/main/java/org/apache/druid/data/input/MaxSizeSplitHintSpec.java
##########
@@ -43,22 +44,53 @@
   public static final String TYPE = "maxSize";
 
   @VisibleForTesting
-  static final long DEFAULT_MAX_SPLIT_SIZE = 512 * 1024 * 1024;
+  static final HumanReadableBytes DEFAULT_MAX_SPLIT_SIZE = new HumanReadableBytes("512MiB");
 
-  private final long maxSplitSize;
+  /**
+   * There are two known issues when a split contains a large list of files.
+   *
+   * - 'jute.maxbuffer' in ZooKeeper. This system property controls the max size of ZNode. As its default is 500KB,
+   *   task allocation can fail if the serialized ingestion spec is larger than this limit.
+   * - 'max_allowed_packet' in MySQL. This is the max size of a communication packet sent to a MySQL server.
+   *   The default is either 64MB or 4MB depending on MySQL version. Updating metadata store can fail if the serialized
+   *   ingestion spec is larger than this limit.
+   *
+   * The default is consertively chosen as 1000.
+   */
+  @VisibleForTesting
+  static final int DEFAULT_MAX_NUM_FILES = 1000;
+
+  private final HumanReadableBytes maxSplitSize;
+  private final int maxNumFiles;
 
   @JsonCreator
-  public MaxSizeSplitHintSpec(@JsonProperty("maxSplitSize") @Nullable Long maxSplitSize)
+  public MaxSizeSplitHintSpec(
+      @JsonProperty("maxSplitSize") @Nullable HumanReadableBytes maxSplitSize,
+      @JsonProperty("maxNumFiles") @Nullable Integer maxNumFiles
+  )
   {
     this.maxSplitSize = maxSplitSize == null ? DEFAULT_MAX_SPLIT_SIZE : maxSplitSize;
+    this.maxNumFiles = maxNumFiles == null ? DEFAULT_MAX_NUM_FILES : maxNumFiles;

Review comment:
       Should we add a preconditionCheck on maxNumFiles? If the user has entered maxNumFiles, it should be a positive number >= 1
   
   I think what you've implemented works even with negative numbers, but maybe it's better to tell the user they're doing something strange.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] jihoonson merged pull request #10243: Add maxNumFiles to splitHintSpec

Posted by GitBox <gi...@apache.org>.

jihoonson merged pull request #10243:
URL: https://github.com/apache/druid/pull/10243


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] jihoonson commented on a change in pull request #10243: Add maxNumFiles to splitHintSpec

Posted by GitBox <gi...@apache.org>.

jihoonson commented on a change in pull request #10243:
URL: https://github.com/apache/druid/pull/10243#discussion_r470774466



##########
File path: docs/ingestion/native-batch.md
##########
@@ -232,7 +232,8 @@ The size-based split hint spec is respected by all splittable input sources exce
 |property|description|default|required?|
 |--------|-----------|-------|---------|
 |type|This should always be `maxSize`.|none|yes|
-|maxSplitSize|Maximum number of bytes of input files to process in a single task. If a single file is larger than this number, it will be processed by itself in a single task (Files are never split across tasks yet).|500MB|no|
+|maxSplitSize|Maximum number of bytes of input files to process in a single task. If a single file is larger than this number, it will be processed by itself in a single task (Files are never split across tasks yet). [Human-readable format](../configuration/human-readable-byte.md) is supported.|500MiB|no|
+|maxNumFiles|Maximum number of input files to process in a single task. This limit is to avoid task failures when the ingestion spec is too long. There are two known limits on the max size of serialized ingestion spec, i.e., the max ZNode size in ZooKeeper (`jute.maxbuffer`) and the max packet size in MySQL (`max_allowed_packet`). These can make ingestion tasks fail if the serialized ingestion spec size hits one of them.|5000|no|

Review comment:
       Oops, nice catch! 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org