You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@gobblin.apache.org by GitBox <gi...@apache.org> on 2021/03/12 00:11:42 UTC

[GitHub] [gobblin] sv2000 commented on pull request #3158: [GOBBLIN-1377] Adds functionality to create shards for target directories in hive distcp

sv2000 commented on pull request #3158:
URL: https://github.com/apache/gobblin/pull/3158#issuecomment-797141187


   @Will-Lo Thanks for the PR. Going through the PR, I have a different proposal for how dataset-specific logic can be injected. We could invoke one or more configurable handler classes inside the AbstractJobLauncher right after the workunit creation step. For your use case, you could use the DATASET_URN property which is set in different source classes. You could provide an implementation of a handler that sets dataset-specific staging/output dirs. Please take a look at the TaskStateCollectorServiceHandler class as an example, which is invoked on task completion. What I am proposing here is an identical solution, but invoked before a job is launched.  
   
   This high level approach is leverageable for future use cases e.g. where a Gobblin pipeline needs to write to a Table (e.g. Iceberg) and we could define a handler that ensures table is created before the job is started. 
   
   Happy to discuss offline if you have questions. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org