You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@gobblin.apache.org by Abhishek Tiwari <ab...@apache.org> on 2018/03/23 18:19:55 UTC

S3 to Hive

(moved conversation from Gitter)

> Tilak Patidar @tilakpatidar 02:21
> Hi all, I have a use case about which I wanted a general idea to solve it
> using gobblin.
> We are getting the data from our client in form of CSV dumps in s3 buckets
> periodically. These dumps could be deltas or full dumps we don’t know what
> it will be. We need to write this data into a hive table. So, while writing
> we might have to check for changes in a row based on primary key and only
> update in hive if data has changed for that primary key. How can this be
> solved using Gobblin? I looked into hive merge but was wondering how could
> I use this with gobblin.


Hi Tilak,

What kind of scale are you looking at? .. and do you have managed Hive
tables or external?
If I recall correctly, updates can only be applied to managed Hive ORC
tables. However, I suspect if lookup and update would work well at high
volume. If your volume is low and Hive table managed, then you can look
into S3 source, converter for lookup, and a JDBC writer.
However for high volume, your use-case looks similar to our database ingest
at LinkedIn. We ingest snapshots, as well as increments and apply them on
snapshots. We materialize deltas into snapshots less frequently but instead
use specialized readers to read data from snapshot with deltas applied on
them at read time. The delta materialization into snapshot is done via a
legacy system, which is on its way to be replaced with Gobblin.

Abhishek

Re: S3 to Hive

Posted by Abhishek Tiwari <ab...@apache.org>.

Through Gobblin the common pattern is writing data out in Avro files and
using the Hive registration to create partitions / tables.
JDBC writer is possible, but is not written yet for Hive.

Abhishek

On Sat, Mar 24, 2018 at 12:04 PM, Tilak Patidar <ti...@gmail.com>
wrote:

> Hey,
>
> The scale is small about 50-80GB.
> I am going with the approach of reading CSVs from the S3 bucket and then
> writing a custom converter which converts
>  FileAwareInputStream into GenericRecord. In this converter doing a batched
> lookup and filtering out rows. However,
> I was unable to find any HiveJDBCPublisher. How is data usually ingested
> into hive using Gobblin? JDBC or something better?
>
>
> *Regards**Tilak Patidar*
> Email             tilakpatidar@gmail.com
>                       tilakpr@thoughtworks.com
> Telephone+91 8608690984
> [image: ThoughtWorks] <http://www.thoughtworks.com/>*"We are a community
> of
> passionate individuals whose purpose is to revolutionize software design,
> creation and delivery, while advocating for positive social change"*
> <http://www.thoughtworks.com/>
>
> On Fri, Mar 23, 2018 at 11:52 PM, Abhishek Tiwari <ab...@apache.org> wrote:
>
> > + user@gobblin
> >
> > On Fri, Mar 23, 2018 at 11:19 AM, Abhishek Tiwari <ab...@apache.org>
> wrote:
> >
> > > (moved conversation from Gitter)
> > >
> > >> Tilak Patidar @tilakpatidar 02:21
> > >> Hi all, I have a use case about which I wanted a general idea to solve
> > it
> > >> using gobblin.
> > >> We are getting the data from our client in form of CSV dumps in s3
> > >> buckets periodically. These dumps could be deltas or full dumps we
> don’t
> > >> know what it will be. We need to write this data into a hive table.
> So,
> > >> while writing we might have to check for changes in a row based on
> > primary
> > >> key and only update in hive if data has changed for that primary key.
> > How
> > >> can this be solved using Gobblin? I looked into hive merge but was
> > >> wondering how could I use this with gobblin.
> > >
> > >
> > > Hi Tilak,
> > >
> > > What kind of scale are you looking at? .. and do you have managed Hive
> > > tables or external?
> > > If I recall correctly, updates can only be applied to managed Hive ORC
> > > tables. However, I suspect if lookup and update would work well at high
> > > volume. If your volume is low and Hive table managed, then you can look
> > > into S3 source, converter for lookup, and a JDBC writer.
> > > However for high volume, your use-case looks similar to our database
> > > ingest at LinkedIn. We ingest snapshots, as well as increments and
> apply
> > > them on snapshots. We materialize deltas into snapshots less frequently
> > but
> > > instead use specialized readers to read data from snapshot with deltas
> > > applied on them at read time. The delta materialization into snapshot
> is
> > > done via a legacy system, which is on its way to be replaced with
> > Gobblin.
> > >
> > > Abhishek
> > >
> > >
> > >
> >
>

Re: S3 to Hive

Posted by Tilak Patidar <ti...@gmail.com>.

Hey,

The scale is small about 50-80GB.
I am going with the approach of reading CSVs from the S3 bucket and then
writing a custom converter which converts
 FileAwareInputStream into GenericRecord. In this converter doing a batched
lookup and filtering out rows. However,
I was unable to find any HiveJDBCPublisher. How is data usually ingested
into hive using Gobblin? JDBC or something better?


*Regards**Tilak Patidar*
Email             tilakpatidar@gmail.com
                      tilakpr@thoughtworks.com
Telephone+91 8608690984
[image: ThoughtWorks] <http://www.thoughtworks.com/>*"We are a community of
passionate individuals whose purpose is to revolutionize software design,
creation and delivery, while advocating for positive social change"*
<http://www.thoughtworks.com/>

On Fri, Mar 23, 2018 at 11:52 PM, Abhishek Tiwari <ab...@apache.org> wrote:

> + user@gobblin
>
> On Fri, Mar 23, 2018 at 11:19 AM, Abhishek Tiwari <ab...@apache.org> wrote:
>
> > (moved conversation from Gitter)
> >
> >> Tilak Patidar @tilakpatidar 02:21
> >> Hi all, I have a use case about which I wanted a general idea to solve
> it
> >> using gobblin.
> >> We are getting the data from our client in form of CSV dumps in s3
> >> buckets periodically. These dumps could be deltas or full dumps we don’t
> >> know what it will be. We need to write this data into a hive table. So,
> >> while writing we might have to check for changes in a row based on
> primary
> >> key and only update in hive if data has changed for that primary key.
> How
> >> can this be solved using Gobblin? I looked into hive merge but was
> >> wondering how could I use this with gobblin.
> >
> >
> > Hi Tilak,
> >
> > What kind of scale are you looking at? .. and do you have managed Hive
> > tables or external?
> > If I recall correctly, updates can only be applied to managed Hive ORC
> > tables. However, I suspect if lookup and update would work well at high
> > volume. If your volume is low and Hive table managed, then you can look
> > into S3 source, converter for lookup, and a JDBC writer.
> > However for high volume, your use-case looks similar to our database
> > ingest at LinkedIn. We ingest snapshots, as well as increments and apply
> > them on snapshots. We materialize deltas into snapshots less frequently
> but
> > instead use specialized readers to read data from snapshot with deltas
> > applied on them at read time. The delta materialization into snapshot is
> > done via a legacy system, which is on its way to be replaced with
> Gobblin.
> >
> > Abhishek
> >
> >
> >
>

Re: S3 to Hive

Posted by Abhishek Tiwari <ab...@apache.org>.

+ user@gobblin

On Fri, Mar 23, 2018 at 11:19 AM, Abhishek Tiwari <ab...@apache.org> wrote:

> (moved conversation from Gitter)
>
>> Tilak Patidar @tilakpatidar 02:21
>> Hi all, I have a use case about which I wanted a general idea to solve it
>> using gobblin.
>> We are getting the data from our client in form of CSV dumps in s3
>> buckets periodically. These dumps could be deltas or full dumps we don’t
>> know what it will be. We need to write this data into a hive table. So,
>> while writing we might have to check for changes in a row based on primary
>> key and only update in hive if data has changed for that primary key. How
>> can this be solved using Gobblin? I looked into hive merge but was
>> wondering how could I use this with gobblin.
>
>
> Hi Tilak,
>
> What kind of scale are you looking at? .. and do you have managed Hive
> tables or external?
> If I recall correctly, updates can only be applied to managed Hive ORC
> tables. However, I suspect if lookup and update would work well at high
> volume. If your volume is low and Hive table managed, then you can look
> into S3 source, converter for lookup, and a JDBC writer.
> However for high volume, your use-case looks similar to our database
> ingest at LinkedIn. We ingest snapshots, as well as increments and apply
> them on snapshots. We materialize deltas into snapshots less frequently but
> instead use specialized readers to read data from snapshot with deltas
> applied on them at read time. The delta materialization into snapshot is
> done via a legacy system, which is on its way to be replaced with Gobblin.
>
> Abhishek
>
>
>

Re: S3 to Hive

Posted by Abhishek Tiwari <ab...@apache.org>.

+ user@gobblin

On Fri, Mar 23, 2018 at 11:19 AM, Abhishek Tiwari <ab...@apache.org> wrote:

> (moved conversation from Gitter)
>
>> Tilak Patidar @tilakpatidar 02:21
>> Hi all, I have a use case about which I wanted a general idea to solve it
>> using gobblin.
>> We are getting the data from our client in form of CSV dumps in s3
>> buckets periodically. These dumps could be deltas or full dumps we don’t
>> know what it will be. We need to write this data into a hive table. So,
>> while writing we might have to check for changes in a row based on primary
>> key and only update in hive if data has changed for that primary key. How
>> can this be solved using Gobblin? I looked into hive merge but was
>> wondering how could I use this with gobblin.
>
>
> Hi Tilak,
>
> What kind of scale are you looking at? .. and do you have managed Hive
> tables or external?
> If I recall correctly, updates can only be applied to managed Hive ORC
> tables. However, I suspect if lookup and update would work well at high
> volume. If your volume is low and Hive table managed, then you can look
> into S3 source, converter for lookup, and a JDBC writer.
> However for high volume, your use-case looks similar to our database
> ingest at LinkedIn. We ingest snapshots, as well as increments and apply
> them on snapshots. We materialize deltas into snapshots less frequently but
> instead use specialized readers to read data from snapshot with deltas
> applied on them at read time. The delta materialization into snapshot is
> done via a legacy system, which is on its way to be replaced with Gobblin.
>
> Abhishek
>
>
>