You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@gobblin.apache.org by Abhishek Tiwari <ab...@apache.org> on 2018/04/02 23:51:21 UTC
Re: Gobblin Clustering for Streaming.

Hi Vicky,

This is a good summary. While more people will add their story, why don't
you start a wiki page with these details.
I will also chime in with different use-cases we use Gobblin for once I
clear a bit of my backlog.

Regards
Abhishek

On Wed, Mar 28, 2018 at 3:29 AM, Vicky Kak <vi...@gmail.com> wrote:

> Hi Guys,
>
> I am in process of using the gobblin cluster to address the streaming use
> case, I have yet to look at the code. However I would like to validate my
> understanding and design approaches based of the quantum of data to be
> ingested via
> gobblin. Following is how I will classify the gobblin solutions based on
> quantum of data
>
> Quantum of Data :
>
> - Small/Medium - This can be processed using the standalone mode.
> - Large - This can be processed using the MR or YARN mode.
> - Unbounded ( Stream) - Gobblin Cluster.
>
> A bounded data ( small/medium large) to be ingested using the Gobblin
> could have a meta data which will help the Source to partition the data and
> create the WorkUnits, however there are some cases we don't have metadata
> of the Source data so the partitionion of data upfront is not possible. In
> the later case we may Iterate the entire source and while doing so create
> the partitions, this is discussed here
> https://mail-archives.apache.org/mod_mbox/incubator-
> gobblin-user/201709.mbox/browser
>
> For the case of unbounded data we need to address following
> - Have a Gobblin nodes which will be processing the partitioned data.
> - The data should be partitioned so that it could be processed faster and
> continuously.
> - Fault tolerance.
> - Scaling of Gobblin processing nodes.
>
> Currently it seems that the unbounded data use case is handled via gobblin
> cluster and kafka. Here is how it is addressed as of now
> - The unbounded data is  pushed to the Kafka which does partition the data.
> - The Source implementation can create the WorkUnits which will read the
> data per kafka partition.
> - The starting of the Job creates the workunits based on the existing
> partitons. For each partition there is a Task pushed to the Distributed
> Tasq queue based on Helix.
> - Gobblin Cluster is based on the Master/Worker architecture and using
> Helix Task Framework under the hood.
>
> I would like to hear more about the usage patterns from the community and
> developer team so that I can consolidate the information and can post it to
> wiki for the use of others too.
>
> Thanks,
> Vicky
>
>
>
>
>