You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by amit jaiswal <am...@yahoo.com> on 2010/08/02 08:26:42 UTC

How to mount/proxy a db table in hive

Hi,

I have a database and am looking for a way to 'mount' the db table in hive in 
such a way that the select query in hive gets translated to sql query for 
database. I saw DBInputFormat and sqoop, but nothing that can create a proxy 
table in hive which internally makes db calls.

I also tried to use custom variant of DBInputFormat as the input format for the 
database table.

create table employee (id int, name string) stored as INPUTFORMAT 
'mycustominputformat' OUTPUTFORMAT 
'org.apache.hadoop.mapred.SequenceFileOutputFormat';

select id from employee;
This fails while running hadoop job because HiveInputFormat only supports 
FileSplits.

HiveInputFormat:
    public long getStart() {
      if (inputSplit instanceof FileSplit) {
        return ((FileSplit)inputSplit).getStart();
      }
      return 0;
    }

Any suggestions as if there are any InputFormat implementation that can be 
used? 

-amit

Re: How to mount/proxy a db table in hive

Posted by John Sichi <js...@facebook.com>.

On Aug 2, 2010, at 7:28 AM, Edward Capriolo wrote:
> Maybe the new 'strorage handlers' would help. Storage handlers tie
> together, input formats, serde's and create/drop table functions. a
> JDBC backend storage handler would be a pretty neat thing.

Yes, a storage handler for wrapping a JDBC table would be handy.  There have been some discussions about using this for exposing the metastore via Hive views (e.g. INFORMATION_SCHEMA).

The storage handler support is good for wrapping everything together, but the issue with the split type would still need to be resolved.

JVS

Re: How to mount/proxy a db table in hive

Posted by John Sichi <js...@facebook.com>.

Either the handler would need to provide its own InputFormat and Split classes wrapping the ones from DBInputFormat (following the example from existing storage handlers such as HBase, where HBaseSplit extends FileSplit and wraps an underlying TableSplit), or we would need to finally clean up HiveInputFormat to stop assuming everything is file-based.

JVS

On Aug 4, 2010, at 3:34 AM, amit jaiswal wrote:

> Hi,
> 
> Any pointers as what needs to be done for implementing storage handler for this 
> functionality? What all things need to be taken care of. Will it be a small 
> change, or something big?
> 
> -regards
> Amit
> 
> 
> ----- Original Message ----
> From: Edward Capriolo <ed...@gmail.com>
> To: hive-user@hadoop.apache.org
> Sent: Mon, 2 August, 2010 7:58:55 PM
> Subject: Re: How to mount/proxy a db table in hive
> 
> On Mon, Aug 2, 2010 at 2:33 AM, Sonal Goyal <so...@gmail.com> wrote:
>> Hi Amit,
>> 
>> Hive needs data to be stored in its own namespace. Can you please explain
>> why you want to call the database through Hive ?
>> 
>> Thanks and Regards,
>> Sonal
>> www.meghsoft.com
>> http://in.linkedin.com/in/sonalgoyal
>> 
>> 
>> On Mon, Aug 2, 2010 at 11:56 AM, amit jaiswal <am...@yahoo.com> wrote:
>>> 
>>> Hi,
>>> 
>>> I have a database and am looking for a way to 'mount' the db table in hive
>>> in
>>> such a way that the select query in hive gets translated to sql query for
>>> database. I saw DBInputFormat and sqoop, but nothing that can create a
>>> proxy
>>> table in hive which internally makes db calls.
>>> 
>>> I also tried to use custom variant of DBInputFormat as the input format
>>> for the
>>> database table.
>>> 
>>> create table employee (id int, name string) stored as INPUTFORMAT
>>> 'mycustominputformat' OUTPUTFORMAT
>>> 'org.apache.hadoop.mapred.SequenceFileOutputFormat';
>>> 
>>> select id from employee;
>>> This fails while running hadoop job because HiveInputFormat only supports
>>> FileSplits.
>>> 
>>> HiveInputFormat:
>>>   public long getStart() {
>>>     if (inputSplit instanceof FileSplit) {
>>>       return ((FileSplit)inputSplit).getStart();
>>>     }
>>>     return 0;
>>>   }
>>> 
>>> Any suggestions as if there are any InputFormat implementation that can be
>>> used?
>>> 
>>> -amit
>> 
>> 
> Maybe the new 'strorage handlers' would help. Storage handlers tie
> together, input formats, serde's and create/drop table functions. a
> JDBC backend storage handler would be a pretty neat thing.
>

Re: How to mount/proxy a db table in hive

Posted by amit jaiswal <am...@yahoo.com>.

Hi,

Any pointers as what needs to be done for implementing storage handler for this 
functionality? What all things need to be taken care of. Will it be a small 
change, or something big?

-regards
Amit


----- Original Message ----
From: Edward Capriolo <ed...@gmail.com>
To: hive-user@hadoop.apache.org
Sent: Mon, 2 August, 2010 7:58:55 PM
Subject: Re: How to mount/proxy a db table in hive

On Mon, Aug 2, 2010 at 2:33 AM, Sonal Goyal <so...@gmail.com> wrote:
> Hi Amit,
>
> Hive needs data to be stored in its own namespace. Can you please explain
> why you want to call the database through Hive ?
>
> Thanks and Regards,
> Sonal
> www.meghsoft.com
> http://in.linkedin.com/in/sonalgoyal
>
>
> On Mon, Aug 2, 2010 at 11:56 AM, amit jaiswal <am...@yahoo.com> wrote:
>>
>> Hi,
>>
>> I have a database and am looking for a way to 'mount' the db table in hive
>> in
>> such a way that the select query in hive gets translated to sql query for
>> database. I saw DBInputFormat and sqoop, but nothing that can create a
>> proxy
>> table in hive which internally makes db calls.
>>
>> I also tried to use custom variant of DBInputFormat as the input format
>> for the
>> database table.
>>
>> create table employee (id int, name string) stored as INPUTFORMAT
>> 'mycustominputformat' OUTPUTFORMAT
>> 'org.apache.hadoop.mapred.SequenceFileOutputFormat';
>>
>> select id from employee;
>> This fails while running hadoop job because HiveInputFormat only supports
>> FileSplits.
>>
>> HiveInputFormat:
>>    public long getStart() {
>>      if (inputSplit instanceof FileSplit) {
>>        return ((FileSplit)inputSplit).getStart();
>>      }
>>      return 0;
>>    }
>>
>> Any suggestions as if there are any InputFormat implementation that can be
>> used?
>>
>> -amit
>
>
Maybe the new 'strorage handlers' would help. Storage handlers tie
together, input formats, serde's and create/drop table functions. a
JDBC backend storage handler would be a pretty neat thing.

Re: How to mount/proxy a db table in hive

Posted by Edward Capriolo <ed...@gmail.com>.

On Mon, Aug 2, 2010 at 2:33 AM, Sonal Goyal <so...@gmail.com> wrote:
> Hi Amit,
>
> Hive needs data to be stored in its own namespace. Can you please explain
> why you want to call the database through Hive ?
>
> Thanks and Regards,
> Sonal
> www.meghsoft.com
> http://in.linkedin.com/in/sonalgoyal
>
>
> On Mon, Aug 2, 2010 at 11:56 AM, amit jaiswal <am...@yahoo.com> wrote:
>>
>> Hi,
>>
>> I have a database and am looking for a way to 'mount' the db table in hive
>> in
>> such a way that the select query in hive gets translated to sql query for
>> database. I saw DBInputFormat and sqoop, but nothing that can create a
>> proxy
>> table in hive which internally makes db calls.
>>
>> I also tried to use custom variant of DBInputFormat as the input format
>> for the
>> database table.
>>
>> create table employee (id int, name string) stored as INPUTFORMAT
>> 'mycustominputformat' OUTPUTFORMAT
>> 'org.apache.hadoop.mapred.SequenceFileOutputFormat';
>>
>> select id from employee;
>> This fails while running hadoop job because HiveInputFormat only supports
>> FileSplits.
>>
>> HiveInputFormat:
>>    public long getStart() {
>>      if (inputSplit instanceof FileSplit) {
>>        return ((FileSplit)inputSplit).getStart();
>>      }
>>      return 0;
>>    }
>>
>> Any suggestions as if there are any InputFormat implementation that can be
>> used?
>>
>> -amit
>
>
Maybe the new 'strorage handlers' would help. Storage handlers tie
together, input formats, serde's and create/drop table functions. a
JDBC backend storage handler would be a pretty neat thing.

Re: which version of thrift.jar in Hive 0.5

Posted by Carl Steinbach <ca...@cloudera.com>.

Hi Stephen,

libthrift.jar was last updated in HIVE-438 (
https://issues.apache.org/jira/browse/HIVE-438). The version of
libthrift.jar that was checked into lib/ was built from apache thrift
revision 760184. In other words, it's not an official release version but
instead something that was built from trunk.

Thanks.

Carl

On Tue, Aug 3, 2010 at 11:45 AM, Stephen Watt <sw...@us.ibm.com> wrote:

> Hi Folks
>
> I've had a look in the various build scripts and in the manifest of the jar
> itself but I am unable to identify which version of thrift is stored in
> thrift.jar. I can't see any references to a maven repository where it is
> pulled down either. Can someone tell me which version of thrift this is and
> where the jar was built and downloaded from ?
>
> Regards
> Steve Watt
>
>
>  From: amit jaiswal <am...@yahoo.com> To: hive-user@hadoop.apache.org
> Date: 08/02/2010 01:46 AM Subject: Re: How to mount/proxy a db table in
> hive
> ------------------------------
>
>
>
> The original data is stored in database, and there is no need to create a
> separate copy of the database in HDFS for every job. Extending the notion of
> database, the data can be stored in any storage. One way of abstracting out
> things would to be  implement a InputFormat that knows how to read the data,
> and provide the correct InputSplit and RecordReader implementation. The
> custom input format that I had mentioned works fine in a pure hadoop job.
>
> Is it possible to leverage the input format support in hive table creation
> to make such queries. Just 'select * from <table>' API support will also be
> sufficient as the actual sql query can be part of the InputFormat
> implementation.
>
> -amit
>
>
> ------------------------------
> *From:* Sonal Goyal <so...@gmail.com>*
> To:* hive-user@hadoop.apache.org*
> Sent:* Mon, 2 August, 2010 12:03:32 PM*
> Subject:* Re: How to mount/proxy a db table in hive
>
> Hi Amit,
>
> Hive needs data to be stored in its own namespace. Can you please explain
> why you want to call the database through Hive ?
>
> Thanks and Regards,
> Sonal*
> **www.meghsoft.com* <http://www.meghsoft.com/>*
> **http://in.linkedin.com/in/sonalgoyal*<http://in.linkedin.com/in/sonalgoyal>
>
>
> On Mon, Aug 2, 2010 at 11:56 AM, amit jaiswal <*a...@yahoo.com>>
> wrote:
> Hi,
>
> I have a database and am looking for a way to 'mount' the db table in hive
> in
> such a way that the select query in hive gets translated to sql query for
> database. I saw DBInputFormat and sqoop, but nothing that can create a
> proxy
> table in hive which internally makes db calls.
>
> I also tried to use custom variant of DBInputFormat as the input format for
> the
> database table.
>
> create table employee (id int, name string) stored as INPUTFORMAT
> 'mycustominputformat' OUTPUTFORMAT
> 'org.apache.hadoop.mapred.SequenceFileOutputFormat';
>
> select id from employee;
> This fails while running hadoop job because HiveInputFormat only supports
> FileSplits.
>
> HiveInputFormat:
>   public long getStart() {
>     if (inputSplit instanceof FileSplit) {
>       return ((FileSplit)inputSplit).getStart();
>     }
>     return 0;
>   }
>
> Any suggestions as if there are any InputFormat implementation that can be
> used?
>
> -amit
>
>
>

which version of thrift.jar in Hive 0.5

Posted by Stephen Watt <sw...@us.ibm.com>.

Hi Folks

I've had a look in the various build scripts and in the manifest of the 
jar itself but I am unable to identify which version of thrift is stored 
in thrift.jar. I can't see any references to a maven repository where it 
is pulled down either. Can someone tell me which version of thrift this is 
and where the jar was built and downloaded from ? 

Regards
Steve Watt 



From:
amit jaiswal <am...@yahoo.com>
To:
hive-user@hadoop.apache.org
Date:
08/02/2010 01:46 AM
Subject:
Re: How to mount/proxy a db table in hive



The original data is stored in database, and there is no need to create a 
separate copy of the database in HDFS for every job. Extending the notion 
of database, the data can be stored in any storage. One way of abstracting 
out things would to be  implement a InputFormat that knows how to read the 
data, and provide the correct InputSplit and RecordReader implementation. 
The custom input format that I had mentioned works fine in a pure hadoop 
job.

Is it possible to leverage the input format support in hive table creation 
to make such queries. Just 'select * from <table>' API support will also 
be sufficient as the actual sql query can be part of the InputFormat 
implementation.

-amit


From: Sonal Goyal <so...@gmail.com>
To: hive-user@hadoop.apache.org
Sent: Mon, 2 August, 2010 12:03:32 PM
Subject: Re: How to mount/proxy a db table in hive

Hi Amit,

Hive needs data to be stored in its own namespace. Can you please explain 
why you want to call the database through Hive ?
 
Thanks and Regards,
Sonal
www.meghsoft.com
http://in.linkedin.com/in/sonalgoyal


On Mon, Aug 2, 2010 at 11:56 AM, amit jaiswal <am...@yahoo.com> wrote:
Hi,

I have a database and am looking for a way to 'mount' the db table in hive 
in
such a way that the select query in hive gets translated to sql query for
database. I saw DBInputFormat and sqoop, but nothing that can create a 
proxy
table in hive which internally makes db calls.

I also tried to use custom variant of DBInputFormat as the input format 
for the
database table.

create table employee (id int, name string) stored as INPUTFORMAT
'mycustominputformat' OUTPUTFORMAT
'org.apache.hadoop.mapred.SequenceFileOutputFormat';

select id from employee;
This fails while running hadoop job because HiveInputFormat only supports
FileSplits.

HiveInputFormat:
   public long getStart() {
     if (inputSplit instanceof FileSplit) {
       return ((FileSplit)inputSplit).getStart();
     }
     return 0;
   }

Any suggestions as if there are any InputFormat implementation that can be
used?

-amit

Re: How to mount/proxy a db table in hive

Posted by amit jaiswal <am...@yahoo.com>.

The original data is stored in database, and there is no need to create a 
separate copy of the database in HDFS for every job. Extending the notion of 
database, the data can be stored in any storage. One way of abstracting out 
things would to be  implement a InputFormat that knows how to read the data, and 
provide the correct InputSplit and RecordReader implementation. The custom input 
format that I had mentioned works fine in a pure hadoop job.

Is it possible to leverage the input format support in hive table creation to 
make such queries. Just 'select * from <table>' API support will also be 
sufficient as the actual sql query can be part of the InputFormat 
implementation.

-amit

________________________________
From: Sonal Goyal <so...@gmail.com>
To: hive-user@hadoop.apache.org
Sent: Mon, 2 August, 2010 12:03:32 PM
Subject: Re: How to mount/proxy a db table in hive

Hi Amit,

Hive needs data to be stored in its own namespace. Can you please explain why 
you want to call the database through Hive ?

Thanks and Regards,
Sonal
www.meghsoft.com
http://in.linkedin.com/in/sonalgoyal

On Mon, Aug 2, 2010 at 11:56 AM, amit jaiswal <am...@yahoo.com> wrote:

Hi,
>
>I have a database and am looking for a way to 'mount' the db table in hive in
>such a way that the select query in hive gets translated to sql query for
>database. I saw DBInputFormat and sqoop, but nothing that can create a proxy
>table in hive which internally makes db calls.
>
>I also tried to use custom variant of DBInputFormat as the input format for the
>database table.
>
>create table employee (id int, name string) stored as INPUTFORMAT
>'mycustominputformat' OUTPUTFORMAT
>'org.apache.hadoop.mapred.SequenceFileOutputFormat';
>
>select id from employee;
>This fails while running hadoop job because HiveInputFormat only supports
>FileSplits.
>
>HiveInputFormat:
>   public long getStart() {
>     if (inputSplit instanceof FileSplit) {
>       return ((FileSplit)inputSplit).getStart();
>     }
>     return 0;
>   }
>
>Any suggestions as if there are any InputFormat implementation that can be
>used?
>
>-amit
>

Re: How to mount/proxy a db table in hive

Posted by Sonal Goyal <so...@gmail.com>.

Hi Amit,

Hive needs data to be stored in its own namespace. Can you please explain
why you want to call the database through Hive ?

Thanks and Regards,
Sonal
www.meghsoft.com
http://in.linkedin.com/in/sonalgoyal


On Mon, Aug 2, 2010 at 11:56 AM, amit jaiswal <am...@yahoo.com> wrote:

> Hi,
>
> I have a database and am looking for a way to 'mount' the db table in hive
> in
> such a way that the select query in hive gets translated to sql query for
> database. I saw DBInputFormat and sqoop, but nothing that can create a
> proxy
> table in hive which internally makes db calls.
>
> I also tried to use custom variant of DBInputFormat as the input format for
> the
> database table.
>
> create table employee (id int, name string) stored as INPUTFORMAT
> 'mycustominputformat' OUTPUTFORMAT
> 'org.apache.hadoop.mapred.SequenceFileOutputFormat';
>
> select id from employee;
> This fails while running hadoop job because HiveInputFormat only supports
> FileSplits.
>
> HiveInputFormat:
>    public long getStart() {
>      if (inputSplit instanceof FileSplit) {
>        return ((FileSplit)inputSplit).getStart();
>      }
>      return 0;
>    }
>
> Any suggestions as if there are any InputFormat implementation that can be
> used?
>
> -amit
>