You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Ruslan Al-Fakikh <me...@gmail.com> on 2012/06/13 23:36:09 UTC

Quering RDBMS table in a Hive query

Hello to everyone,

I need to join hdfs data with little data taken from RDBMS. A possible
solution is to import RDBMS data to a regular hive table using Sqoop,
but this way I'll have to keep that imported hive table up-to-date
which means that I will have to update it every time before joining in
a query.
Is there a way to load RDBMS data on the fly? Maybe a UDF which would
take RDBMS connection properties and load the data?

Thanks in advance,
Ruslan Al-Fakikh

Re: Quering RDBMS table in a Hive query

Posted by Ruslan Al-Fakikh <me...@gmail.com>.
Bejoy,

Again, I do understand those two steps, and I do understand that I
have a lot of options of making them run in sequence, but from the
very beginning my point was to avoid having two steps. I want to have
a dataset in the hive warehouse that I could query at any time with
just a hive query without any preliminary imports/queries. So
implementing a custom UDF/InputFormat looks best for now except for
having too many rdbms connections (one connection per mapper as far as
I understand).

Thanks

On Sat, Jun 16, 2012 at 6:04 AM, Bejoy KS <be...@yahoo.com> wrote:
> Hi Ruslan
>
> The solution Esteban pointed out was
> 1. Import look up data from RDBMS to hdfs/hive (you can fire any adhoc query here). If the data is just a few mbs one or two maps/connections are enough.
>
> 2. A look up on this smaller data can be achieved in terms of joining that with larger table
>
> Now since the look up table is small, enable map joins so that the look up table is in the distributed cache and that data is used by map tasks for join.
>
> The two sequential steps mentioned above can be scheduled using a workflow manager as oozie.
>
> In simple terms you can place these steps in order in a shell script and just execute the script.
>
>
> Regards
> Bejoy KS
>
> Sent from handheld, please excuse typos.
>
> -----Original Message-----
> From: Ruslan Al-Fakikh <me...@gmail.com>
> Date: Sat, 16 Jun 2012 04:40:36
> To: <us...@hive.apache.org>
> Reply-To: user@hive.apache.org
> Subject: Re: Quering RDBMS table in a Hive query
>
> Hi Esteban,
>
> Your solution is what I am trying to avoid, having to keep the hdfs
> data up-to-date. I know I can easily schedule a dependency between the
> Sqoop import job and the hive query job and currently we have a
> scheduling tool (opswise) for such things. But what if I just want to
> run an ad hoc query and forget to re-import the lookup data, etc?
> Maybe there is a way to put the Sqoop import as a hook for a
> particular hive table making it run before every query?
> But I understand the problem of having too many connections. I would
> like to have it only once and distribute it over all the mappers in a
> distributed cache or something like it. Isn't there a way for it?
>
> Ruslan
>
> On Fri, Jun 15, 2012 at 9:43 PM, Esteban Gutierrez <es...@cloudera.com> wrote:
>> Hi Ruslan,
>>
>> Jan's approach sounds like a good workaround only if you can use the output
>> in a mapjoin, but I don't think it will scale nicely if you have a very
>> large number of  tasks since that will translate as  DB connections to
>> MySQL. I think a more scalable and reliable way is just to schedule an Oozie
>> workflow to transfer the data from MySQL to HDFS using Sqoop and trigger the
>> Hive query once the transfer was done.
>>
>> cheers!
>> esteban.
>>
>> --
>> Cloudera, Inc.
>>
>>
>>
>>
>> On Fri, Jun 15, 2012 at 10:28 AM, Ruslan Al-Fakikh <me...@gmail.com>
>> wrote:
>>>
>>> Thanks Jan
>>>
>>> On Fri, Jun 15, 2012 at 4:35 PM, Jan Dolinár <do...@gmail.com> wrote:
>>> > On 6/15/12, Ruslan Al-Fakikh <ru...@jalent.ru> wrote:
>>> >> I didn't know InputFormat and LineReader could help, though I didn't
>>> >> look at them closely. I was thinking about implementing a
>>> >> Table-Generating Function (UDTF) if there is no an already implemented
>>> >> solution.
>>> >
>>> > Both is possible, InputFormat and/or UD(T)F. It all depends on what
>>> > you need. I actually use both - in Input format I load lists of
>>> > allowed values to check the data and in UDF I query some other
>>> > database for values necessary only in some queries. Generally, I'd use
>>> >  InputFormat for situations where all jobs over given table would
>>> > require the additional data from RDBMS. Oppositely, in situations
>>> > where only few jobs out of many requires the RDBMS connection, I would
>>> > use UDF.
>>> >
>>> > I think that the difference in performance between the two is rather
>>> > small, if any. Also UDF is easier to write, so it might be the "weapon
>>> > of choice", at least if you don't already use custom InputFormat.
>>> >
>>> > Jan
>>
>>

Re: Quering RDBMS table in a Hive query

Posted by Bejoy KS <be...@yahoo.com>.
Hi Ruslan

The solution Esteban pointed out was
1. Import look up data from RDBMS to hdfs/hive (you can fire any adhoc query here). If the data is just a few mbs one or two maps/connections are enough.

2. A look up on this smaller data can be achieved in terms of joining that with larger table

Now since the look up table is small, enable map joins so that the look up table is in the distributed cache and that data is used by map tasks for join.

The two sequential steps mentioned above can be scheduled using a workflow manager as oozie. 

In simple terms you can place these steps in order in a shell script and just execute the script.

 
Regards
Bejoy KS

Sent from handheld, please excuse typos.

-----Original Message-----
From: Ruslan Al-Fakikh <me...@gmail.com>
Date: Sat, 16 Jun 2012 04:40:36 
To: <us...@hive.apache.org>
Reply-To: user@hive.apache.org
Subject: Re: Quering RDBMS table in a Hive query

Hi Esteban,

Your solution is what I am trying to avoid, having to keep the hdfs
data up-to-date. I know I can easily schedule a dependency between the
Sqoop import job and the hive query job and currently we have a
scheduling tool (opswise) for such things. But what if I just want to
run an ad hoc query and forget to re-import the lookup data, etc?
Maybe there is a way to put the Sqoop import as a hook for a
particular hive table making it run before every query?
But I understand the problem of having too many connections. I would
like to have it only once and distribute it over all the mappers in a
distributed cache or something like it. Isn't there a way for it?

Ruslan

On Fri, Jun 15, 2012 at 9:43 PM, Esteban Gutierrez <es...@cloudera.com> wrote:
> Hi Ruslan,
>
> Jan's approach sounds like a good workaround only if you can use the output
> in a mapjoin, but I don't think it will scale nicely if you have a very
> large number of  tasks since that will translate as  DB connections to
> MySQL. I think a more scalable and reliable way is just to schedule an Oozie
> workflow to transfer the data from MySQL to HDFS using Sqoop and trigger the
> Hive query once the transfer was done.
>
> cheers!
> esteban.
>
> --
> Cloudera, Inc.
>
>
>
>
> On Fri, Jun 15, 2012 at 10:28 AM, Ruslan Al-Fakikh <me...@gmail.com>
> wrote:
>>
>> Thanks Jan
>>
>> On Fri, Jun 15, 2012 at 4:35 PM, Jan Dolinár <do...@gmail.com> wrote:
>> > On 6/15/12, Ruslan Al-Fakikh <ru...@jalent.ru> wrote:
>> >> I didn't know InputFormat and LineReader could help, though I didn't
>> >> look at them closely. I was thinking about implementing a
>> >> Table-Generating Function (UDTF) if there is no an already implemented
>> >> solution.
>> >
>> > Both is possible, InputFormat and/or UD(T)F. It all depends on what
>> > you need. I actually use both - in Input format I load lists of
>> > allowed values to check the data and in UDF I query some other
>> > database for values necessary only in some queries. Generally, I'd use
>> >  InputFormat for situations where all jobs over given table would
>> > require the additional data from RDBMS. Oppositely, in situations
>> > where only few jobs out of many requires the RDBMS connection, I would
>> > use UDF.
>> >
>> > I think that the difference in performance between the two is rather
>> > small, if any. Also UDF is easier to write, so it might be the "weapon
>> > of choice", at least if you don't already use custom InputFormat.
>> >
>> > Jan
>
>

Re: Quering RDBMS table in a Hive query

Posted by Ruslan Al-Fakikh <me...@gmail.com>.
Hi Esteban,

Your solution is what I am trying to avoid, having to keep the hdfs
data up-to-date. I know I can easily schedule a dependency between the
Sqoop import job and the hive query job and currently we have a
scheduling tool (opswise) for such things. But what if I just want to
run an ad hoc query and forget to re-import the lookup data, etc?
Maybe there is a way to put the Sqoop import as a hook for a
particular hive table making it run before every query?
But I understand the problem of having too many connections. I would
like to have it only once and distribute it over all the mappers in a
distributed cache or something like it. Isn't there a way for it?

Ruslan

On Fri, Jun 15, 2012 at 9:43 PM, Esteban Gutierrez <es...@cloudera.com> wrote:
> Hi Ruslan,
>
> Jan's approach sounds like a good workaround only if you can use the output
> in a mapjoin, but I don't think it will scale nicely if you have a very
> large number of  tasks since that will translate as  DB connections to
> MySQL. I think a more scalable and reliable way is just to schedule an Oozie
> workflow to transfer the data from MySQL to HDFS using Sqoop and trigger the
> Hive query once the transfer was done.
>
> cheers!
> esteban.
>
> --
> Cloudera, Inc.
>
>
>
>
> On Fri, Jun 15, 2012 at 10:28 AM, Ruslan Al-Fakikh <me...@gmail.com>
> wrote:
>>
>> Thanks Jan
>>
>> On Fri, Jun 15, 2012 at 4:35 PM, Jan Dolinár <do...@gmail.com> wrote:
>> > On 6/15/12, Ruslan Al-Fakikh <ru...@jalent.ru> wrote:
>> >> I didn't know InputFormat and LineReader could help, though I didn't
>> >> look at them closely. I was thinking about implementing a
>> >> Table-Generating Function (UDTF) if there is no an already implemented
>> >> solution.
>> >
>> > Both is possible, InputFormat and/or UD(T)F. It all depends on what
>> > you need. I actually use both - in Input format I load lists of
>> > allowed values to check the data and in UDF I query some other
>> > database for values necessary only in some queries. Generally, I'd use
>> >  InputFormat for situations where all jobs over given table would
>> > require the additional data from RDBMS. Oppositely, in situations
>> > where only few jobs out of many requires the RDBMS connection, I would
>> > use UDF.
>> >
>> > I think that the difference in performance between the two is rather
>> > small, if any. Also UDF is easier to write, so it might be the "weapon
>> > of choice", at least if you don't already use custom InputFormat.
>> >
>> > Jan
>
>

Re: Quering RDBMS table in a Hive query

Posted by Esteban Gutierrez <es...@cloudera.com>.
Hi Ruslan,

Jan's approach sounds like a good workaround only if you can use the output
in a mapjoin, but I don't think it will scale nicely if you have a very
large number of  tasks since that will translate as  DB connections to
MySQL. I think a more scalable and reliable way is just to schedule an
Oozie workflow to transfer the data from MySQL to HDFS using Sqoop and
trigger the Hive query once the transfer was done.

cheers!
esteban.

--
Cloudera, Inc.




On Fri, Jun 15, 2012 at 10:28 AM, Ruslan Al-Fakikh <me...@gmail.com>wrote:

> Thanks Jan
>
> On Fri, Jun 15, 2012 at 4:35 PM, Jan Dolinár <do...@gmail.com> wrote:
> > On 6/15/12, Ruslan Al-Fakikh <ru...@jalent.ru> wrote:
> >> I didn't know InputFormat and LineReader could help, though I didn't
> >> look at them closely. I was thinking about implementing a
> >> Table-Generating Function (UDTF) if there is no an already implemented
> >> solution.
> >
> > Both is possible, InputFormat and/or UD(T)F. It all depends on what
> > you need. I actually use both - in Input format I load lists of
> > allowed values to check the data and in UDF I query some other
> > database for values necessary only in some queries. Generally, I'd use
> >  InputFormat for situations where all jobs over given table would
> > require the additional data from RDBMS. Oppositely, in situations
> > where only few jobs out of many requires the RDBMS connection, I would
> > use UDF.
> >
> > I think that the difference in performance between the two is rather
> > small, if any. Also UDF is easier to write, so it might be the "weapon
> > of choice", at least if you don't already use custom InputFormat.
> >
> > Jan
>

Re: Quering RDBMS table in a Hive query

Posted by Ruslan Al-Fakikh <me...@gmail.com>.
Thanks Jan

On Fri, Jun 15, 2012 at 4:35 PM, Jan Dolinár <do...@gmail.com> wrote:
> On 6/15/12, Ruslan Al-Fakikh <ru...@jalent.ru> wrote:
>> I didn't know InputFormat and LineReader could help, though I didn't
>> look at them closely. I was thinking about implementing a
>> Table-Generating Function (UDTF) if there is no an already implemented
>> solution.
>
> Both is possible, InputFormat and/or UD(T)F. It all depends on what
> you need. I actually use both - in Input format I load lists of
> allowed values to check the data and in UDF I query some other
> database for values necessary only in some queries. Generally, I'd use
>  InputFormat for situations where all jobs over given table would
> require the additional data from RDBMS. Oppositely, in situations
> where only few jobs out of many requires the RDBMS connection, I would
> use UDF.
>
> I think that the difference in performance between the two is rather
> small, if any. Also UDF is easier to write, so it might be the "weapon
> of choice", at least if you don't already use custom InputFormat.
>
> Jan

Re: Quering RDBMS table in a Hive query

Posted by Jan Dolinár <do...@gmail.com>.
On 6/15/12, Ruslan Al-Fakikh <ru...@jalent.ru> wrote:
> I didn't know InputFormat and LineReader could help, though I didn't
> look at them closely. I was thinking about implementing a
> Table-Generating Function (UDTF) if there is no an already implemented
> solution.

Both is possible, InputFormat and/or UD(T)F. It all depends on what
you need. I actually use both - in Input format I load lists of
allowed values to check the data and in UDF I query some other
database for values necessary only in some queries. Generally, I'd use
 InputFormat for situations where all jobs over given table would
require the additional data from RDBMS. Oppositely, in situations
where only few jobs out of many requires the RDBMS connection, I would
use UDF.

I think that the difference in performance between the two is rather
small, if any. Also UDF is easier to write, so it might be the "weapon
of choice", at least if you don't already use custom InputFormat.

Jan

Re: Quering RDBMS table in a Hive query

Posted by Ruslan Al-Fakikh <ru...@jalent.ru>.
Thanks Jan,

I didn't know InputFormat and LineReader could help, though I didn't
look at them closely. I was thinking about implementing a
Table-Generating Function (UDTF) if there is no an already implemented
solution.

Ruslan

On Thu, Jun 14, 2012 at 10:03 AM, Jan Dolinár <do...@gmail.com> wrote:
> Hi Ruslan,
>
> I've been in similar situation and solved it by writing a custom
> InputFormat and LineReader that loads the data from MySQL in
> constructor. In my case I use it just to check value ranges and
> similar stuff. If you want to join the data with whats in your hdfs
> files, you can do that as well, InputFormat allows you to add the
> columns easily. I'm not sure how well this solution would behave for a
> bigger data, but for small data (I load about 5 tables, ~100 lines
> each) it works just fine.
>
> Best Regards,
> Jan
>
>
>
> On 6/13/12, Ruslan Al-Fakikh <me...@gmail.com> wrote:
>> Hello to everyone,
>>
>> I need to join hdfs data with little data taken from RDBMS. A possible
>> solution is to import RDBMS data to a regular hive table using Sqoop,
>> but this way I'll have to keep that imported hive table up-to-date
>> which means that I will have to update it every time before joining in
>> a query.
>> Is there a way to load RDBMS data on the fly? Maybe a UDF which would
>> take RDBMS connection properties and load the data?
>>
>> Thanks in advance,
>> Ruslan Al-Fakikh
>>



-- 
Best Regards,
Ruslan Al-Fakikh

Re: Quering RDBMS table in a Hive query

Posted by Jan Dolinár <do...@gmail.com>.
Hi Ruslan,

I've been in similar situation and solved it by writing a custom
InputFormat and LineReader that loads the data from MySQL in
constructor. In my case I use it just to check value ranges and
similar stuff. If you want to join the data with whats in your hdfs
files, you can do that as well, InputFormat allows you to add the
columns easily. I'm not sure how well this solution would behave for a
bigger data, but for small data (I load about 5 tables, ~100 lines
each) it works just fine.

Best Regards,
Jan



On 6/13/12, Ruslan Al-Fakikh <me...@gmail.com> wrote:
> Hello to everyone,
>
> I need to join hdfs data with little data taken from RDBMS. A possible
> solution is to import RDBMS data to a regular hive table using Sqoop,
> but this way I'll have to keep that imported hive table up-to-date
> which means that I will have to update it every time before joining in
> a query.
> Is there a way to load RDBMS data on the fly? Maybe a UDF which would
> take RDBMS connection properties and load the data?
>
> Thanks in advance,
> Ruslan Al-Fakikh
>