You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2021/01/22 18:43:31 UTC

[GitHub] [druid] techdocsmith opened a new pull request #10788: suggest index parallel for native batch reindexing > 1GB

techdocsmith opened a new pull request #10788:
URL: https://github.com/apache/druid/pull/10788


   <!-- Thanks for trying to help us make Apache Druid be the best it can be! Please fill out as much of the following information as is possible (where relevant, and remove it when irrelevant) to help make the intention and scope of this PR clear in order to ease review. -->
   
   Removes outdated recommendation to use Hadoop for production.
   
   
   <hr>
   
   This PR has:
   - [x] been self-reviewed.
   
   cc: @petermarshallio @druid-matt
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] techdocsmith commented on a change in pull request #10788: suggest index parallel for native batch reindexing > 1GB

Posted by GitBox <gi...@apache.org>.

techdocsmith commented on a change in pull request #10788:
URL: https://github.com/apache/druid/pull/10788#discussion_r562873222



##########
File path: docs/ingestion/data-management.md
##########
@@ -232,11 +232,7 @@ There are other types of `inputSpec` to enable reindexing and delta ingestion.
 
 ### Reindexing with Native Batch Ingestion
 
-This section assumes the reader understands how to do batch ingestion without Hadoop using [native batch indexing](../ingestion/native-batch.md),
-which uses an `inputSource` to know where and how to read the input data. The [`DruidInputSource`](native-batch.md#druid-input-source)
-can be used to read data from segments inside Druid. Note that IndexTask is to be used for prototyping purposes only as
-it has to do all processing inside a single process and can't scale. Please use Hadoop batch ingestion for production
-scenarios dealing with more than 1GB of data.
+This section assumes you understand how to do batch ingestion without Hadoop using [native batch indexing](../ingestion/native-batch.md). Native batch indexing uses an `inputSource` to know where and how to read the input data. You can use the [`DruidInputSource`](native-batch.md#druid-input-source) to read data from segments inside Druid. Use the Index task (`index`) for prototyping purposes because it relies on a single process and can't scale. Use Parallel task (`index_parallel`) to ingest more than 1GB of data.

Review comment:
       Thanks @jihoonson . I changed it PTAL




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] jihoonson merged pull request #10788: suggest index parallel for native batch reindexing > 1GB

Posted by GitBox <gi...@apache.org>.

jihoonson merged pull request #10788:
URL: https://github.com/apache/druid/pull/10788


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] jihoonson commented on a change in pull request #10788: suggest index parallel for native batch reindexing > 1GB

Posted by GitBox <gi...@apache.org>.

jihoonson commented on a change in pull request #10788:
URL: https://github.com/apache/druid/pull/10788#discussion_r562859292



##########
File path: docs/ingestion/data-management.md
##########
@@ -232,11 +232,7 @@ There are other types of `inputSpec` to enable reindexing and delta ingestion.
 
 ### Reindexing with Native Batch Ingestion
 
-This section assumes the reader understands how to do batch ingestion without Hadoop using [native batch indexing](../ingestion/native-batch.md),
-which uses an `inputSource` to know where and how to read the input data. The [`DruidInputSource`](native-batch.md#druid-input-source)
-can be used to read data from segments inside Druid. Note that IndexTask is to be used for prototyping purposes only as
-it has to do all processing inside a single process and can't scale. Please use Hadoop batch ingestion for production
-scenarios dealing with more than 1GB of data.
+This section assumes you understand how to do batch ingestion without Hadoop using [native batch indexing](../ingestion/native-batch.md). Native batch indexing uses an `inputSource` to know where and how to read the input data. You can use the [`DruidInputSource`](native-batch.md#druid-input-source) to read data from segments inside Druid. Use the Index task (`index`) for prototyping purposes because it relies on a single process and can't scale. Use Parallel task (`index_parallel`) to ingest more than 1GB of data.

Review comment:
       `index_parallel` behaves almost the same as `index` when `maxNumConcurrentSubTasks` is 1. So, I think we can suggest to always use `index_parallel`, but change `maxNumConcurrentSubTasks` depending on data size.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] jihoonson commented on pull request #10788: suggest index parallel for native batch reindexing > 1GB

Posted by GitBox <gi...@apache.org>.

jihoonson commented on pull request #10788:
URL: https://github.com/apache/druid/pull/10788#issuecomment-765873378


   The integration test failures should be irrelevant to the doc change.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org