You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@gobblin.apache.org by Abhishek Tiwari <ab...@apache.org> on 2017/12/13 12:11:10 UTC

Re: Gobblin As Service Questions

Hi Vicky,

Sorry for missing to reply to this earlier. SpecExecutorInstance started
off as a representation for the Executor, however, it has not evolved into
more of a 'communication mechanism', so the KafkaSpecExecutor can be used
by any executor that supports communication over Kafka (which currently
includes standalone cluster, and by virtue of it Yarn as well as AWS
cluster modes)

Regards,
Abhishek

On Fri, Nov 24, 2017 at 1:33 AM, Vicky Kak <vi...@gmail.com> wrote:

> Hi Abhishek,
>
> I have started looking into the GAAS again.
>
> It seems that the standalone mode does not have the SpecExecutorInstance
> configured.
>
> As per the wiki the diagram there indicates that the SpecExecutorInstance
> is for Standalone Cluster and Gobblin on Azkaban ( which you had mentioned
> in last email too).
> I don't see it for the YARN/MR mode too. I am not able to see it for the
> AWS too.
>
> Is there any specific reason for dropping the implementation for these
> modes?
>
>
> Thanks,
> Vicky
>
> On Fri, Sep 8, 2017 at 5:47 AM, Abhishek Tiwari <ab...@apache.org> wrote:
>
>> Response inlined in red.
>>
>> On Fri, Jul 28, 2017 at 5:23 AM, Vicky Kak <vi...@gmail.com> wrote:
>>
>>> Hi Abhishek,
>>>
>>> Some of the review points after going through the wiki
>>>
>>> 1) There is no component available by the name of "FlowManager", it
>>> seems the FlowManager is basically the FlowConfigsResource+RestLi handling
>>> the user invocation.
>>>
>> Yes, thats correct.
>>
>>>
>>> 2) There is not explicit mention of the trigerring of the existing Flow,
>>> it seems to be triggered via the POST call as mentoned in the documentation
>>> as
>>>
>>> curli http://localhost:8080/flowconfigs -X POST -H 'X-RestLi-Method:
>>> create' -H 'X-RestLi-Protocol-Version: 2.0.0' --data '{"flowName" :
>>> "myflow1", "flowGroup" : "mygroup", "templateNames" :
>>> "FS:///mytemplate.template", "schedule" : "", "properties" : {"prop1" :
>>> "value1"}}'
>>>
>>> Flow is run on a schedule with possibility of runImmediately. I think it
>> will help if I can come up with a runbook :)
>>
>>>
>>> 3) You can see the type in the wiki in 2, check the curli part.
>>>
>>> 4) I am not able to see the code related Monitoring being present in the
>>> GobblinServiceManager, where is the monitoring piece present?
>>>
>> I think the Monitoring piece was never open sourced. The person who wrote
>> it left org, I will try to operationalize it and push it to open source.
>> Though its lower on my todo list, so might take time.
>>
>>>
>>> 5) The Appendix section contains the reference to the Components which
>>> seems not be present like SimpleRESTSpecExecutor,OrchestratorModule(
>>> module name should be removed) and many more are possible. Also I am not
>>> able to search for GobblinRestFlowMonitor etc.. I have got build erros in
>>> the Eclipse may be that is the reason I am not able to see these classes.
>>>
>> The SpecExecutors are still being added in, other Kafka based, I just
>> added the Azkaban one which basically is REST based SpecExecutor. A few
>> class names have changed too (not a great thing, but everything is still
>> evolving :) )
>>
>>>
>>> Also I see the the GAAS sending the Jobs to the SpecExecutorInstance via
>>> Kafka/git etc however I am yet not able to find how the
>>> SpecExecutorInstance is configured in the Gobblin Instances where the Jobs
>>> should be constructed and triggered. How and where do we configure the
>>> SpecExecutorIntance for the Gobblins Instances for which the Jobs can be
>>> configured/triggered via GAAS?
>>>
>> Look for StreamingJobConfigurationManager
>>
>>>
>>>
>>
>>> Thanks,
>>> Vicky
>>>
>>> On Fri, Jul 28, 2017 at 9:07 AM, Vicky Kak <vi...@gmail.com> wrote:
>>>
>>>> I can see the images now.
>>>>
>>>> Thanks,
>>>> Vicky
>>>>
>>>> On Fri, Jul 28, 2017 at 9:05 AM, Abhishek Tiwari <
>>>> abhishektiwari.btech@gmail.com> wrote:
>>>>
>>>>> Hi Vicky,
>>>>>
>>>>> I have fixed the images, please check again.
>>>>>
>>>>> Regards,
>>>>> Abhishek
>>>>>
>>>>>
>>>>> On Thu, Jul 27, 2017 at 8:20 PM, Vicky Kak <vi...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Thanks Abhishek for the confirmation.
>>>>>>
>>>>>> I am not able to see the images in the GAAS wiki, the images seems to
>>>>>> be coming from the google docs and I could make that my id does not have
>>>>>> access. May be making he images public would help, can you please check why
>>>>>> I am not able to see the images in the wiki?
>>>>>>
>>>>>> Regards,
>>>>>> Vicky
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, Jul 27, 2017 at 7:41 PM, Abhishek Tiwari <ab...@apache.org>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Vicky,
>>>>>>>
>>>>>>> My responses are inlined in blue. You are on right track.
>>>>>>>
>>>>>>> Also the design doc of Gobblin as a Service for your reference:
>>>>>>> https://cwiki.apache.org/confluence/display/GOBBL
>>>>>>> IN/Gobblin+as+a+Service
>>>>>>>
>>>>>>> Regards,
>>>>>>> Abhishek
>>>>>>>
>>>>>>> On Wed, Jul 26, 2017 at 5:45 AM, Vicky Kak <vi...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I did spend more time looking at the code details and have
>>>>>>>> following to share.
>>>>>>>>
>>>>>>>> I do see that GobblinServiceManager( this is bootstrap class for
>>>>>>>> the gobblin service) performing these
>>>>>>>> 1) Initialising the TopologyCatalog,FlowCatalog,He
>>>>>>>> lix,ServiceScheduler,EmbeddedLiServer and finally
>>>>>>>> Orchestator/TopologySpecFactory.
>>>>>>>> 2) The FlowConfigClient seems to creating the FlowConfig, then
>>>>>>>> FlowSpec via FlowConfigResource ( via RestEndpoint).
>>>>>>>> 3) The JobSpec gets added to the FlowCatalog after which the
>>>>>>>> Orchestrator pushes the JobSpec to the Kafka via
>>>>>>>> SimpleKafkaStepExecutionProducer.
>>>>>>>>
>>>>>>>> I have been looking for a code which will use the
>>>>>>>> SimpleKafkaStepExecutionConsumer,  but could not find how it is
>>>>>>>> hooked with the running instance of the Gobblin.
>>>>>>>>
>>>>>>> Look at gobblin-cluster and default config for classes being loaded
>>>>>>> for listeners, JobConfigurationManager, etc.
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> Here is how the gobblin service will invoke the Jobs on slaves(
>>>>>>>> gobblin instances)
>>>>>>>>
>>>>>>>> 1) We should have the rest endpoint information so that we can send
>>>>>>>> the JobSpec via FlowConfigClient or via the HTTP GET( rest call, I have not
>>>>>>>> yet tried this). I don't see a way to get the port when the rest server is
>>>>>>>> started.
>>>>>>>>
>>>>>>> We should make it configurable, right now it chooses random port.
>>>>>>>
>>>>>>>
>>>>>>>> 2) The JobSpec is passed to the Kafka via the
>>>>>>>> SimpleKafkaStepExecutionProducer from the gobblin service via
>>>>>>>> Orchestrator.
>>>>>>>> 3) There could be multiple instances of the Gobblin which could be
>>>>>>>> listening to the Kafka using the SimpleKafkaStepExecutionConsumer,
>>>>>>>> all the Gobblin instance should get the JobSpecs. The one instance which
>>>>>>>> matches the job specs should trigger the Job.
>>>>>>>>
>>>>>>> Yes, we can make this a bit less ambiguous though.
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> The Gobblin service acts as a master and provides the rest endpoint
>>>>>>>> to read/create the JobSpecs which will get triggered on the slaves( which
>>>>>>>> are the Gobblin instances).
>>>>>>>> I have yet not been able to run the flow since there are some build
>>>>>>>> issues I am getting via building the gobblin from the master, the tests are
>>>>>>>> failing right now.
>>>>>>>>
>>>>>>>> Can someone from the development team validate if I am on right
>>>>>>>> tract in terms of understanding the implementation and flows?
>>>>>>>>
>>>>>>> You are on right track.
>>>>>>>
>>>>>>>>
>>>>>>>> I have got more questions which I will post after I confirm that I
>>>>>>>> am not missing anything.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Vicky
>>>>>>>>
>>>>>>>> On Tue, Jul 25, 2017 at 5:03 PM, Vicky Kak <vi...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> To my surprise after I looked at the code and referred the
>>>>>>>>> presentation that Shrishanka had send my ignorance about Gobblin As A
>>>>>>>>> Service was removed
>>>>>>>>>
>>>>>>>>> Gobblin As a service : It is a Global Orchestrator which helps in
>>>>>>>>> submitting the logical flow specifications which are further compiled to
>>>>>>>>> the physical pipelines.
>>>>>>>>>
>>>>>>>>> We have been triggering the Gobblin Jobs using the RestEnd point
>>>>>>>>> and it is done by implementing the custom service as explained here
>>>>>>>>> https://groups.google.com/forum/#!topic/gobblin-users/kHrWh6lfGJM
>>>>>>>>>
>>>>>>>>> I have got the following questions
>>>>>>>>>
>>>>>>>>> 1) What is the use case for Gobblin As service, I don't see the
>>>>>>>>> Orchestrator's rest endpoint port being configurable. If we have to add
>>>>>>>>> FlowSpec using the different machine we need to know the Orchestrator's
>>>>>>>>> host and port details, how do we do it?
>>>>>>>>>
>>>>>>>> We use d2 registry internally for it (if you dont already know
>>>>>>> about it - search for RESTLI D2)
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>> 2) Does FlowSpec creation creates a new Job deployment which can
>>>>>>>>> also by copying the corresponding .pull or .job file in the gobblin
>>>>>>>>> distribution?
>>>>>>>>>
>>>>>>>> If you are saying that if you bundle a pull file in gobblin
>>>>>>> distribution and create the same via FlowSpec would it mean the same thing,
>>>>>>> then yes. Else I didnt understand the question.
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>> 3) Since the master.out log gets created when starting a service,
>>>>>>>>> I assume there could be a way to add more Orchestrators to the master that
>>>>>>>>> is started. However I am not sure how to do that, can this be clarified?
>>>>>>>>>
>>>>>>>> Only one node acts as orchestrator and scheduler. Rest of the nodes
>>>>>>> receive requests and pass them to master for scheduling and orchestrating
>>>>>>> via Helix messages.
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>> Please note that I have been looking at the older code, the git
>>>>>>>>> log is follow.
>>>>>>>>> ************************************************************
>>>>>>>>> ***********************************
>>>>>>>>> commit 755da9160cd91ea5ebcc752603ce1bffb74a75a1 (HEAD -> master,
>>>>>>>>> origin/master, origin/HEAD)
>>>>>>>>> Author: Kuai Yu <yu...@gmail.com>
>>>>>>>>> Date:   Tue Apr 11 19:10:53 2017 -0700
>>>>>>>>> ************************************************************
>>>>>>>>> ***********************************
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Vicky
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>