You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2019/08/02 01:47:59 UTC

[GitHub] [incubator-druid] jihoonson commented on issue #8061: Native parallel batch indexing with shuffle

jihoonson commented on issue #8061: Native parallel batch indexing with shuffle
URL: https://github.com/apache/incubator-druid/issues/8061#issuecomment-517516557
 
 
   I think two big changes are happening in Druid's indexing service recently. One would be the native parallel indexing and another could be the new Indexer module (https://github.com/apache/incubator-druid/issues/7900). I think this Indexer module would be better than the middleManager in terms of memory management, monitoring, resource scheduling and sharing and hope it would be able to replace the middleManager in the future. It would be nice if we can keep the current behavior for users even with these changes, but I think it wouldn't be easy to keep the current behavior especially with the Indexer.
   
   > data cleanup could be easy if we follow a hierarchy, e.g the baseTaskDir/supervisor-task-id of the supervisor task can serve as the base path for the intermediary location and MM can just ensure that the base path and any underlying sub-dirs are cleaned up when the supervisor task fails.
   
   Yeah, it would be pretty similar to when using MM as the intermediary data server with respect to intermediary data structure and management. But I think it could be more complex when it comes to the operational stuffs like permission or failure handling. Not only middleManager but the overlord should also be able to remove stale intermediary data because the middleManager could miss the supervisor task failure. The attempt to delete intermediary data could also fail and it would be easier if it's a local disk. I mean, it's still doable with deep storage but it would be a bit more complex.
   
   > I see MM as very lightweight processes that have the responsiblity of orchestration of peons, monitoring and cleaning any leftover files/data. It would be great if we can keep it that way.
   
   Hmm, the additional functionality would be just serving intermediary data and cleaning up them periodically. Maybe the middleManager would need a bit more memory but it shouldn't be really big. Would you elaborate more about your concern?
   
   Looking at the code around `AutoScaler` and `Provisioner`, if the overlord can collect the remaining intermediary data on each middleManager, this information could be used for auto scaling. I guess this information could be easily added to existing code base since the overlord is already collecting some data from middleManagers (like host name and port, etc). And since the `AutoScaler` and `Provisioner` are executed in the overlord process, I guess we can easily provide this information to the `Provisioner`. Looks like the implementation wouldn't be not that difficult?
   
   > Efficiency wise I agree that reading data from deep storage (especially when using s3) would be slower than reading it from another MM, but till now I have seen creation of Druid indexes and final segments to be the major bottleneck instead of data transfer.
   
   That sounds interesting. I haven't used Hadoop ingestion much, and I don't know how much data transfer would contribute to the total ingestion performance. Do you have any rough number on this?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org