You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@gobblin.apache.org by Dominique De Vito <dd...@gmail.com> on 2018/01/24 15:35:19 UTC

few short questions to better understand Gobblin scope

Hi,

I have digged a bit into the Gobblin web site. Here below are few questions
the way I understand Gobblin so far:


1) Gobblin seems to use only configuration files => no GUI so far ?


2) Gobblin seems to support the workflow concept as the following:

-- Gobblin = <source> => <WorkUnit_1> =>  .....<WorkUnit_N> => <target>


One picture (in the "Architecture" page) shows this WorkUnit line (inside a
task) :

                extractor => converter => quality checker => fork operator

Is there an order to respect (one kind of WorkUnit, say "converter"
__strictly__ after another kind of WorkUnit, say "quality checker") ?

Or, this order has been only crafted for presentation (in the
"Architecture" page), and there is no strict order between WorkUnit kinds
(for example, one may imagine a converter after quality checker, and not
only before like just above) ?

Thanks

Regards,
Dominique

Re: few short questions to better understand Gobblin scope

Posted by Joel Baranick <jb...@apache.org>.

One other clarification:

extractor => converter1 => converter2 => ... => converterN =>
qualityChecker => fork1 => converter4 => ... => converterN => writer

                                                |

                                                => fork2 => converter5 =>
... => converterN => writer

                                                |

                                                => ...

                                                |

                                                => forkN => converter6 =>
... => converterN => writer


As far as speed goes, we run jobs every minute and the data usually makes
it to S3 within a minute of the job starting.

Joel Baranick
jbaranick@apache.org


On Wed, Jan 24, 2018 at 3:54 PM, Dominique De Vito <dd...@gmail.com>
wrote:

> Thanks Abhishek
>
>
> >Slight correction:
>
> >Gobblin = <source> => <WorkUnit_1>, <WorkUnit_2>, .... => <target>
>
>
> >i.e Workunits are independent of each other and division of overall work.
> They are not steps of the process.
>
> >Each workunit executes the following steps:
>
> >extractor => conveter => quality checker => fork operator => writer
>
> ok, so I understand:
>
> Gobblin = <source> => <WorkUnit_1>, <WorkUnit_2>, .... => <target>
>
> like the following:
>
> Gobblin = <source> => <WorkUnit_1>   => <target> in parallel with
> + <source> => <WorkUnit_2>, .... => <target> in parallel with
> + ....
>
> ===> please, correct me if I am wrong.
>
> IMHO short path for ingestion (like "<source> => <WorkUnit>   =>
> <target>") makes more sense.
>
> Faster ingestion (and then, shorter path) makes more sense, because if
> data are available ASAP in (let's say) HDFS, then faster (because parallel)
> treatment could happen next (in for example Spark)
>
> Nice fit with Gobblin (AFAIU).
>
> Thanks.
>
> Dominique
>
>
>
> 2018-01-24 23:48 GMT+01:00 Abhishek Tiwari <ab...@apache.org>:
>
>> Hi Dominique,
>>
>> Please find my answers inline.
>>
>> On Wed, Jan 24, 2018 at 7:35 AM, Dominique De Vito <dd...@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> I have digged a bit into the Gobblin web site. Here below are few
>>> questions the way I understand Gobblin so far:
>>>
>>>
>>> 1) Gobblin seems to use only configuration files => no GUI so far ?
>>>
>>> We have UI (gobblin-admin) but that does not lets you configure jobs,
>> only view running jobs and their status / history.
>>
>>>
>>> 2) Gobblin seems to support the workflow concept as the following:
>>>
>>> -- Gobblin = <source> => <WorkUnit_1> =>  .....<WorkUnit_N> => <target>
>>>
>>>
>>> One picture (in the "Architecture" page) shows this WorkUnit line
>>> (inside a task) :
>>>
>>>                 extractor => converter => quality checker => fork
>>> operator
>>>
>>> Is there an order to respect (one kind of WorkUnit, say "converter"
>>> __strictly__ after another kind of WorkUnit, say "quality checker") ?
>>>
>>> Or, this order has been only crafted for presentation (in the
>>> "Architecture" page), and there is no strict order between WorkUnit kinds
>>> (for example, one may imagine a converter after quality checker, and not
>>> only before like just above) ?
>>>
>>>
>> Slight correction:
>> Gobblin = <source> => <WorkUnit_1>, <WorkUnit_2>, .... => <target>
>>
>> i.e Workunits are independent of each other and division of overall work.
>> They are not steps of the process.
>> Each workunit executes the following steps:
>> extractor => conveter => quality checker => fork operator => writer
>>
>>
>>
>>> Thanks
>>>
>>> Regards,
>>> Dominique
>>>
>>>
>>>
>>>
>>
>

Re: few short questions to better understand Gobblin scope

Posted by Abhishek Tiwari <ab...@apache.org>.

Yes, that's correct.

One thing to add to this: the source runs in driver process ie. is used to
query metadata of sorts to determine workunits. Thereafter, the workunits
run in parallel. It can align well with Spark.

Regards,
Abhishek


On Wed, Jan 24, 2018 at 3:54 PM, Dominique De Vito <dd...@gmail.com>
wrote:

> Thanks Abhishek
>
>
> >Slight correction:
>
> >Gobblin = <source> => <WorkUnit_1>, <WorkUnit_2>, .... => <target>
>
>
> >i.e Workunits are independent of each other and division of overall work.
> They are not steps of the process.
>
> >Each workunit executes the following steps:
>
> >extractor => conveter => quality checker => fork operator => writer
>
> ok, so I understand:
>
> Gobblin = <source> => <WorkUnit_1>, <WorkUnit_2>, .... => <target>
>
> like the following:
>
> Gobblin = <source> => <WorkUnit_1>   => <target> in parallel with
> + <source> => <WorkUnit_2>, .... => <target> in parallel with
> + ....
>
> ===> please, correct me if I am wrong.
>
> IMHO short path for ingestion (like "<source> => <WorkUnit>   =>
> <target>") makes more sense.
>
> Faster ingestion (and then, shorter path) makes more sense, because if
> data are available ASAP in (let's say) HDFS, then faster (because parallel)
> treatment could happen next (in for example Spark)
>
> Nice fit with Gobblin (AFAIU).
>
> Thanks.
>
> Dominique
>
>
>
> 2018-01-24 23:48 GMT+01:00 Abhishek Tiwari <ab...@apache.org>:
>
>> Hi Dominique,
>>
>> Please find my answers inline.
>>
>> On Wed, Jan 24, 2018 at 7:35 AM, Dominique De Vito <dd...@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> I have digged a bit into the Gobblin web site. Here below are few
>>> questions the way I understand Gobblin so far:
>>>
>>>
>>> 1) Gobblin seems to use only configuration files => no GUI so far ?
>>>
>>> We have UI (gobblin-admin) but that does not lets you configure jobs,
>> only view running jobs and their status / history.
>>
>>>
>>> 2) Gobblin seems to support the workflow concept as the following:
>>>
>>> -- Gobblin = <source> => <WorkUnit_1> =>  .....<WorkUnit_N> => <target>
>>>
>>>
>>> One picture (in the "Architecture" page) shows this WorkUnit line
>>> (inside a task) :
>>>
>>>                 extractor => converter => quality checker => fork
>>> operator
>>>
>>> Is there an order to respect (one kind of WorkUnit, say "converter"
>>> __strictly__ after another kind of WorkUnit, say "quality checker") ?
>>>
>>> Or, this order has been only crafted for presentation (in the
>>> "Architecture" page), and there is no strict order between WorkUnit kinds
>>> (for example, one may imagine a converter after quality checker, and not
>>> only before like just above) ?
>>>
>>>
>> Slight correction:
>> Gobblin = <source> => <WorkUnit_1>, <WorkUnit_2>, .... => <target>
>>
>> i.e Workunits are independent of each other and division of overall work.
>> They are not steps of the process.
>> Each workunit executes the following steps:
>> extractor => conveter => quality checker => fork operator => writer
>>
>>
>>
>>> Thanks
>>>
>>> Regards,
>>> Dominique
>>>
>>>
>>>
>>>
>>
>

Re: few short questions to better understand Gobblin scope

Posted by Joel Baranick <jb...@apache.org>.

One other clarification:

extractor => converter1 => converter2 => ... => converterN =>
qualityChecker => fork1 => converter4 => ... => converterN => writer

                                                |

                                                => fork2 => converter5 =>
... => converterN => writer

                                                |

                                                => ...

                                                |

                                                => forkN => converter6 =>
... => converterN => writer


As far as speed goes, we run jobs every minute and the data usually makes
it to S3 within a minute of the job starting.

Joel Baranick
jbaranick@apache.org


On Wed, Jan 24, 2018 at 3:54 PM, Dominique De Vito <dd...@gmail.com>
wrote:

> Thanks Abhishek
>
>
> >Slight correction:
>
> >Gobblin = <source> => <WorkUnit_1>, <WorkUnit_2>, .... => <target>
>
>
> >i.e Workunits are independent of each other and division of overall work.
> They are not steps of the process.
>
> >Each workunit executes the following steps:
>
> >extractor => conveter => quality checker => fork operator => writer
>
> ok, so I understand:
>
> Gobblin = <source> => <WorkUnit_1>, <WorkUnit_2>, .... => <target>
>
> like the following:
>
> Gobblin = <source> => <WorkUnit_1>   => <target> in parallel with
> + <source> => <WorkUnit_2>, .... => <target> in parallel with
> + ....
>
> ===> please, correct me if I am wrong.
>
> IMHO short path for ingestion (like "<source> => <WorkUnit>   =>
> <target>") makes more sense.
>
> Faster ingestion (and then, shorter path) makes more sense, because if
> data are available ASAP in (let's say) HDFS, then faster (because parallel)
> treatment could happen next (in for example Spark)
>
> Nice fit with Gobblin (AFAIU).
>
> Thanks.
>
> Dominique
>
>
>
> 2018-01-24 23:48 GMT+01:00 Abhishek Tiwari <ab...@apache.org>:
>
>> Hi Dominique,
>>
>> Please find my answers inline.
>>
>> On Wed, Jan 24, 2018 at 7:35 AM, Dominique De Vito <dd...@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> I have digged a bit into the Gobblin web site. Here below are few
>>> questions the way I understand Gobblin so far:
>>>
>>>
>>> 1) Gobblin seems to use only configuration files => no GUI so far ?
>>>
>>> We have UI (gobblin-admin) but that does not lets you configure jobs,
>> only view running jobs and their status / history.
>>
>>>
>>> 2) Gobblin seems to support the workflow concept as the following:
>>>
>>> -- Gobblin = <source> => <WorkUnit_1> =>  .....<WorkUnit_N> => <target>
>>>
>>>
>>> One picture (in the "Architecture" page) shows this WorkUnit line
>>> (inside a task) :
>>>
>>>                 extractor => converter => quality checker => fork
>>> operator
>>>
>>> Is there an order to respect (one kind of WorkUnit, say "converter"
>>> __strictly__ after another kind of WorkUnit, say "quality checker") ?
>>>
>>> Or, this order has been only crafted for presentation (in the
>>> "Architecture" page), and there is no strict order between WorkUnit kinds
>>> (for example, one may imagine a converter after quality checker, and not
>>> only before like just above) ?
>>>
>>>
>> Slight correction:
>> Gobblin = <source> => <WorkUnit_1>, <WorkUnit_2>, .... => <target>
>>
>> i.e Workunits are independent of each other and division of overall work.
>> They are not steps of the process.
>> Each workunit executes the following steps:
>> extractor => conveter => quality checker => fork operator => writer
>>
>>
>>
>>> Thanks
>>>
>>> Regards,
>>> Dominique
>>>
>>>
>>>
>>>
>>
>

Re: few short questions to better understand Gobblin scope

Posted by Dominique De Vito <dd...@gmail.com>.

Thanks Abhishek


>Slight correction:

>Gobblin = <source> => <WorkUnit_1>, <WorkUnit_2>, .... => <target>


>i.e Workunits are independent of each other and division of overall work.
They are not steps of the process.

>Each workunit executes the following steps:

>extractor => conveter => quality checker => fork operator => writer

ok, so I understand:

Gobblin = <source> => <WorkUnit_1>, <WorkUnit_2>, .... => <target>

like the following:

Gobblin = <source> => <WorkUnit_1>   => <target> in parallel with
+ <source> => <WorkUnit_2>, .... => <target> in parallel with
+ ....

===> please, correct me if I am wrong.

IMHO short path for ingestion (like "<source> => <WorkUnit>   => <target>")
makes more sense.

Faster ingestion (and then, shorter path) makes more sense, because if data
are available ASAP in (let's say) HDFS, then faster (because parallel)
treatment could happen next (in for example Spark)

Nice fit with Gobblin (AFAIU).

Thanks.

Dominique



2018-01-24 23:48 GMT+01:00 Abhishek Tiwari <ab...@apache.org>:

> Hi Dominique,
>
> Please find my answers inline.
>
> On Wed, Jan 24, 2018 at 7:35 AM, Dominique De Vito <dd...@gmail.com>
> wrote:
>
>> Hi,
>>
>> I have digged a bit into the Gobblin web site. Here below are few
>> questions the way I understand Gobblin so far:
>>
>>
>> 1) Gobblin seems to use only configuration files => no GUI so far ?
>>
>> We have UI (gobblin-admin) but that does not lets you configure jobs,
> only view running jobs and their status / history.
>
>>
>> 2) Gobblin seems to support the workflow concept as the following:
>>
>> -- Gobblin = <source> => <WorkUnit_1> =>  .....<WorkUnit_N> => <target>
>>
>>
>> One picture (in the "Architecture" page) shows this WorkUnit line (inside
>> a task) :
>>
>>                 extractor => converter => quality checker => fork
>> operator
>>
>> Is there an order to respect (one kind of WorkUnit, say "converter"
>> __strictly__ after another kind of WorkUnit, say "quality checker") ?
>>
>> Or, this order has been only crafted for presentation (in the
>> "Architecture" page), and there is no strict order between WorkUnit kinds
>> (for example, one may imagine a converter after quality checker, and not
>> only before like just above) ?
>>
>>
> Slight correction:
> Gobblin = <source> => <WorkUnit_1>, <WorkUnit_2>, .... => <target>
>
> i.e Workunits are independent of each other and division of overall work.
> They are not steps of the process.
> Each workunit executes the following steps:
> extractor => conveter => quality checker => fork operator => writer
>
>
>
>> Thanks
>>
>> Regards,
>> Dominique
>>
>>
>>
>>
>

Re: few short questions to better understand Gobblin scope

Posted by Abhishek Tiwari <ab...@apache.org>.

Hi Dominique,

Please find my answers inline.

On Wed, Jan 24, 2018 at 7:35 AM, Dominique De Vito <dd...@gmail.com>
wrote:

> Hi,
>
> I have digged a bit into the Gobblin web site. Here below are few
> questions the way I understand Gobblin so far:
>
>
> 1) Gobblin seems to use only configuration files => no GUI so far ?
>
> We have UI (gobblin-admin) but that does not lets you configure jobs, only
view running jobs and their status / history.

>
> 2) Gobblin seems to support the workflow concept as the following:
>
> -- Gobblin = <source> => <WorkUnit_1> =>  .....<WorkUnit_N> => <target>
>
>
> One picture (in the "Architecture" page) shows this WorkUnit line (inside
> a task) :
>
>                 extractor => converter => quality checker => fork operator
>
> Is there an order to respect (one kind of WorkUnit, say "converter"
> __strictly__ after another kind of WorkUnit, say "quality checker") ?
>
> Or, this order has been only crafted for presentation (in the
> "Architecture" page), and there is no strict order between WorkUnit kinds
> (for example, one may imagine a converter after quality checker, and not
> only before like just above) ?
>
>
Slight correction:
Gobblin = <source> => <WorkUnit_1>, <WorkUnit_2>, .... => <target>

i.e Workunits are independent of each other and division of overall work.
They are not steps of the process.
Each workunit executes the following steps:
extractor => conveter => quality checker => fork operator => writer



> Thanks
>
> Regards,
> Dominique
>
>
>
>