You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Otis Gospodnetic <ot...@yahoo.com> on 2010/01/14 05:06:23 UTC

MR on HDFS data inserted via HBase?

Hello,

If I import data into HBase, can I still run a hand-written MapReduce job over that data in HDFS?
That is, not using TableInputFormat to read the data back out via HBase.

Similarly, can one run Hive or Pig scripts against that data, but again, without Hive or Pig reading the data via HBase, but rather getting to it directly via HDFS?  I'm asking because I'm wondering whether storing data in HBase means I can no longer use Hive and Pig to run my ad-hoc jobs.

Thanks,
Otis
--
Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch

Re: MR on HDFS data inserted via HBase?

Posted by Amandeep Khurana <am...@gmail.com>.

> - data that is only available in memory of the regionserver
>

Precisely the reason why I said its non trivial

Re: MR on HDFS data inserted via HBase?

Posted by Otis Gospodnetic <ot...@yahoo.com>.

Thanks.  I'm already turned off. :)  Thanks for the quick advice, Amandeep & Ryan! (saw that 1M inserts/sec, impressive)

Otis




----- Original Message ----
> From: Ryan Rawson <ry...@gmail.com>
> To: hbase-user@hadoop.apache.org
> Sent: Wed, January 13, 2010 11:35:12 PM
> Subject: Re: MR on HDFS data inserted via HBase?
> 
> Hey,
> 
> It isnt just as simple as 'read HBase's files'.  You will also need:
> - data that is only available in memory of the regionserver
> - merge multiple HFiles
> - do delete processing, etc, ie: reproduce the Regionserver read path
> 
> Due to #1, I don't feel like this is a particularly fruitful avenue of
> approach.
> 
> -ryan
> 
> On Wed, Jan 13, 2010 at 8:28 PM, Otis Gospodnetic
> wrote:
> > Hello,
> >
> >
> > ----- Original Message ----
> >
> >> From: Amandeep Khurana 
> >
> >> HBase has its own file format. Reading data from it in your own job will not
> >> be trivial to write, but not impossible.
> >
> > You are referring to HTable, HFile, etc.?
> >
> >> Why would you want to use the underlying data files in the MR jobs? Any
> >> limitation in using the HBase api?
> >
> > Are you referring to writing a MR job that makes use of TableInputFormat and 
> TableOutputFormat as mentioned on 
> http://hadoop.apache.org/hbase/docs/r0.20.2/api/org/apache/hadoop/hbase/mapreduce/package-summary.html#sink 
> ?
> >
> > I think that would work.
> >
> > But I'd also like to be able to run Hive/Pig scripts over the data, and I 
> *think* neither support reading it from HBase.  But they can obviously read it 
> from files in HDFS, that's why I was asking.  But it sounds like anything 
> wanting to read HBase's data without going through the HBase's API and reading 
> from behind its back would have to know how to read from HFile & friends?
> > (and again, I think/assume Hive and Pig don't know how to do that)
> >
> > Thanks,
> > Otis
> >
> >> On Wed, Jan 13, 2010 at 8:06 PM, Otis Gospodnetic <
> >> otis_gospodnetic@yahoo.com> wrote:
> >>
> >> > Hello,
> >> >
> >> > If I import data into HBase, can I still run a hand-written MapReduce job
> >> > over that data in HDFS?
> >> > That is, not using TableInputFormat to read the data back out via HBase.
> >> >
> >> > Similarly, can one run Hive or Pig scripts against that data, but again,
> >> > without Hive or Pig reading the data via HBase, but rather getting to it
> >> > directly via HDFS?  I'm asking because I'm wondering whether storing data 
> in
> >> > HBase means I can no longer use Hive and Pig to run my ad-hoc jobs.
> >> >
> >> > Thanks,
> >> > Otis
> >> > --
> >> > Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch
> >> >
> >> >
> >
> >

Re: MR on HDFS data inserted via HBase?

Posted by Ryan Rawson <ry...@gmail.com>.

Hey,

It isnt just as simple as 'read HBase's files'.  You will also need:
- data that is only available in memory of the regionserver
- merge multiple HFiles
- do delete processing, etc, ie: reproduce the Regionserver read path

Due to #1, I don't feel like this is a particularly fruitful avenue of
approach.

-ryan

On Wed, Jan 13, 2010 at 8:28 PM, Otis Gospodnetic
<ot...@yahoo.com> wrote:
> Hello,
>
>
> ----- Original Message ----
>
>> From: Amandeep Khurana <am...@gmail.com>
>
>> HBase has its own file format. Reading data from it in your own job will not
>> be trivial to write, but not impossible.
>
> You are referring to HTable, HFile, etc.?
>
>> Why would you want to use the underlying data files in the MR jobs? Any
>> limitation in using the HBase api?
>
> Are you referring to writing a MR job that makes use of TableInputFormat and TableOutputFormat as mentioned on http://hadoop.apache.org/hbase/docs/r0.20.2/api/org/apache/hadoop/hbase/mapreduce/package-summary.html#sink ?
>
> I think that would work.
>
> But I'd also like to be able to run Hive/Pig scripts over the data, and I *think* neither support reading it from HBase.  But they can obviously read it from files in HDFS, that's why I was asking.  But it sounds like anything wanting to read HBase's data without going through the HBase's API and reading from behind its back would have to know how to read from HFile & friends?
> (and again, I think/assume Hive and Pig don't know how to do that)
>
> Thanks,
> Otis
>
>> On Wed, Jan 13, 2010 at 8:06 PM, Otis Gospodnetic <
>> otis_gospodnetic@yahoo.com> wrote:
>>
>> > Hello,
>> >
>> > If I import data into HBase, can I still run a hand-written MapReduce job
>> > over that data in HDFS?
>> > That is, not using TableInputFormat to read the data back out via HBase.
>> >
>> > Similarly, can one run Hive or Pig scripts against that data, but again,
>> > without Hive or Pig reading the data via HBase, but rather getting to it
>> > directly via HDFS?  I'm asking because I'm wondering whether storing data in
>> > HBase means I can no longer use Hive and Pig to run my ad-hoc jobs.
>> >
>> > Thanks,
>> > Otis
>> > --
>> > Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch
>> >
>> >
>
>

Re: MR on HDFS data inserted via HBase?

Posted by Otis Gospodnetic <ot...@yahoo.com>.

Yeah, I'm JIRA Watch-ing them.  Thanks.

Otis
--
Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch



----- Original Message ----
> From: Andrew Purtell <ap...@apache.org>
> To: hbase-user@hadoop.apache.org
> Sent: Thu, January 14, 2010 5:29:31 AM
> Subject: Re: MR on HDFS data inserted via HBase?
> 
> There is some work on a SerDe for Hive for HBase ongoing:
> 
>     https://issues.apache.org/jira/browse/HIVE-705
> 
>     https://issues.apache.org/jira/browse/HIVE-806
> 
>   - Andy
> 
> 
> ----- Original Message ----
> > From: Amandeep Khurana 
> > To: hbase-user@hadoop.apache.org
> > Sent: Wed, January 13, 2010 8:36:15 PM
> > Subject: Re: MR on HDFS data inserted via HBase?
> > 
> > Yes, by api I mean TableInputFormat and TableOutputFormat.
> > 
> > Pig has a connector to HBase. Not sure if Hive has one yet.
> > 
> > 
> > Amandeep Khurana
> > Computer Science Graduate Student
> > University of California, Santa Cruz
> > 
> > 
> > On Wed, Jan 13, 2010 at 8:28 PM, Otis Gospodnetic <
> > otis_gospodnetic@yahoo.com> wrote:
> > 
> > > Hello,
> > >
> > >
> > > ----- Original Message ----
> > >
> > > > From: Amandeep Khurana 
> > >
> > > > HBase has its own file format. Reading data from it in your own job will
> > > not
> > > > be trivial to write, but not impossible.
> > >
> > > You are referring to HTable, HFile, etc.?
> > >
> > > > Why would you want to use the underlying data files in the MR jobs? Any
> > > > limitation in using the HBase api?
> > >
> > > Are you referring to writing a MR job that makes use of TableInputFormat
> > > and TableOutputFormat as mentioned on
> > > 
> > 
> http://hadoop.apache.org/hbase/docs/r0.20.2/api/org/apache/hadoop/hbase/mapreduce/package-summary.html#sink?
> > >
> > > I think that would work.
> > >
> > > But I'd also like to be able to run Hive/Pig scripts over the data, and I
> > > *think* neither support reading it from HBase.  But they can obviously read
> > > it from files in HDFS, that's why I was asking.  But it sounds like anything
> > > wanting to read HBase's data without going through the HBase's API and
> > > reading from behind its back would have to know how to read from HFile &
> > > friends?
> > > (and again, I think/assume Hive and Pig don't know how to do that)
> > >
> > > Thanks,
> > > Otis
> > >
> > > > On Wed, Jan 13, 2010 at 8:06 PM, Otis Gospodnetic <
> > > > otis_gospodnetic@yahoo.com> wrote:
> > > >
> > > > > Hello,
> > > > >
> > > > > If I import data into HBase, can I still run a hand-written MapReduce
> > > job
> > > > > over that data in HDFS?
> > > > > That is, not using TableInputFormat to read the data back out via
> > > HBase.
> > > > >
> > > > > Similarly, can one run Hive or Pig scripts against that data, but
> > > again,
> > > > > without Hive or Pig reading the data via HBase, but rather getting to
> > > it
> > > > > directly via HDFS?  I'm asking because I'm wondering whether storing
> > > data in
> > > > > HBase means I can no longer use Hive and Pig to run my ad-hoc jobs.
> > > > >
> > > > > Thanks,
> > > > > Otis
> > > > > --
> > > > > Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch
> > > > >
> > > > >
> > >
> > >

Re: MR on HDFS data inserted via HBase?

Posted by Andrew Purtell <ap...@apache.org>.

There is some work on a SerDe for Hive for HBase ongoing:

    https://issues.apache.org/jira/browse/HIVE-705

    https://issues.apache.org/jira/browse/HIVE-806

  - Andy


----- Original Message ----
> From: Amandeep Khurana <am...@gmail.com>
> To: hbase-user@hadoop.apache.org
> Sent: Wed, January 13, 2010 8:36:15 PM
> Subject: Re: MR on HDFS data inserted via HBase?
> 
> Yes, by api I mean TableInputFormat and TableOutputFormat.
> 
> Pig has a connector to HBase. Not sure if Hive has one yet.
> 
> 
> Amandeep Khurana
> Computer Science Graduate Student
> University of California, Santa Cruz
> 
> 
> On Wed, Jan 13, 2010 at 8:28 PM, Otis Gospodnetic <
> otis_gospodnetic@yahoo.com> wrote:
> 
> > Hello,
> >
> >
> > ----- Original Message ----
> >
> > > From: Amandeep Khurana 
> >
> > > HBase has its own file format. Reading data from it in your own job will
> > not
> > > be trivial to write, but not impossible.
> >
> > You are referring to HTable, HFile, etc.?
> >
> > > Why would you want to use the underlying data files in the MR jobs? Any
> > > limitation in using the HBase api?
> >
> > Are you referring to writing a MR job that makes use of TableInputFormat
> > and TableOutputFormat as mentioned on
> > 
> http://hadoop.apache.org/hbase/docs/r0.20.2/api/org/apache/hadoop/hbase/mapreduce/package-summary.html#sink?
> >
> > I think that would work.
> >
> > But I'd also like to be able to run Hive/Pig scripts over the data, and I
> > *think* neither support reading it from HBase.  But they can obviously read
> > it from files in HDFS, that's why I was asking.  But it sounds like anything
> > wanting to read HBase's data without going through the HBase's API and
> > reading from behind its back would have to know how to read from HFile &
> > friends?
> > (and again, I think/assume Hive and Pig don't know how to do that)
> >
> > Thanks,
> > Otis
> >
> > > On Wed, Jan 13, 2010 at 8:06 PM, Otis Gospodnetic <
> > > otis_gospodnetic@yahoo.com> wrote:
> > >
> > > > Hello,
> > > >
> > > > If I import data into HBase, can I still run a hand-written MapReduce
> > job
> > > > over that data in HDFS?
> > > > That is, not using TableInputFormat to read the data back out via
> > HBase.
> > > >
> > > > Similarly, can one run Hive or Pig scripts against that data, but
> > again,
> > > > without Hive or Pig reading the data via HBase, but rather getting to
> > it
> > > > directly via HDFS?  I'm asking because I'm wondering whether storing
> > data in
> > > > HBase means I can no longer use Hive and Pig to run my ad-hoc jobs.
> > > >
> > > > Thanks,
> > > > Otis
> > > > --
> > > > Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch
> > > >
> > > >
> >
> >

Re: MR on HDFS data inserted via HBase?

Posted by Amandeep Khurana <am...@gmail.com>.

Yes, by api I mean TableInputFormat and TableOutputFormat.

Pig has a connector to HBase. Not sure if Hive has one yet.


Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz


On Wed, Jan 13, 2010 at 8:28 PM, Otis Gospodnetic <
otis_gospodnetic@yahoo.com> wrote:

> Hello,
>
>
> ----- Original Message ----
>
> > From: Amandeep Khurana <am...@gmail.com>
>
> > HBase has its own file format. Reading data from it in your own job will
> not
> > be trivial to write, but not impossible.
>
> You are referring to HTable, HFile, etc.?
>
> > Why would you want to use the underlying data files in the MR jobs? Any
> > limitation in using the HBase api?
>
> Are you referring to writing a MR job that makes use of TableInputFormat
> and TableOutputFormat as mentioned on
> http://hadoop.apache.org/hbase/docs/r0.20.2/api/org/apache/hadoop/hbase/mapreduce/package-summary.html#sink?
>
> I think that would work.
>
> But I'd also like to be able to run Hive/Pig scripts over the data, and I
> *think* neither support reading it from HBase.  But they can obviously read
> it from files in HDFS, that's why I was asking.  But it sounds like anything
> wanting to read HBase's data without going through the HBase's API and
> reading from behind its back would have to know how to read from HFile &
> friends?
> (and again, I think/assume Hive and Pig don't know how to do that)
>
> Thanks,
> Otis
>
> > On Wed, Jan 13, 2010 at 8:06 PM, Otis Gospodnetic <
> > otis_gospodnetic@yahoo.com> wrote:
> >
> > > Hello,
> > >
> > > If I import data into HBase, can I still run a hand-written MapReduce
> job
> > > over that data in HDFS?
> > > That is, not using TableInputFormat to read the data back out via
> HBase.
> > >
> > > Similarly, can one run Hive or Pig scripts against that data, but
> again,
> > > without Hive or Pig reading the data via HBase, but rather getting to
> it
> > > directly via HDFS?  I'm asking because I'm wondering whether storing
> data in
> > > HBase means I can no longer use Hive and Pig to run my ad-hoc jobs.
> > >
> > > Thanks,
> > > Otis
> > > --
> > > Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch
> > >
> > >
>
>

Re: MR on HDFS data inserted via HBase?

Posted by Otis Gospodnetic <ot...@yahoo.com>.

Hello,

----- Original Message ----

> From: Amandeep Khurana <am...@gmail.com>

> HBase has its own file format. Reading data from it in your own job will not
> be trivial to write, but not impossible.

You are referring to HTable, HFile, etc.?

> Why would you want to use the underlying data files in the MR jobs? Any
> limitation in using the HBase api?

Are you referring to writing a MR job that makes use of TableInputFormat and TableOutputFormat as mentioned on http://hadoop.apache.org/hbase/docs/r0.20.2/api/org/apache/hadoop/hbase/mapreduce/package-summary.html#sink ?

I think that would work.

But I'd also like to be able to run Hive/Pig scripts over the data, and I *think* neither support reading it from HBase.  But they can obviously read it from files in HDFS, that's why I was asking.  But it sounds like anything wanting to read HBase's data without going through the HBase's API and reading from behind its back would have to know how to read from HFile & friends?
(and again, I think/assume Hive and Pig don't know how to do that)

Thanks,
Otis

> On Wed, Jan 13, 2010 at 8:06 PM, Otis Gospodnetic <
> otis_gospodnetic@yahoo.com> wrote:
> 
> > Hello,
> >
> > If I import data into HBase, can I still run a hand-written MapReduce job
> > over that data in HDFS?
> > That is, not using TableInputFormat to read the data back out via HBase.
> >
> > Similarly, can one run Hive or Pig scripts against that data, but again,
> > without Hive or Pig reading the data via HBase, but rather getting to it
> > directly via HDFS?  I'm asking because I'm wondering whether storing data in
> > HBase means I can no longer use Hive and Pig to run my ad-hoc jobs.
> >
> > Thanks,
> > Otis
> > --
> > Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch
> >
> >

Re: MR on HDFS data inserted via HBase?

Posted by Amandeep Khurana <am...@gmail.com>.

HBase has its own file format. Reading data from it in your own job will not
be trivial to write, but not impossible.

Why would you want to use the underlying data files in the MR jobs? Any
limitation in using the HBase api?

On Wed, Jan 13, 2010 at 8:06 PM, Otis Gospodnetic <
otis_gospodnetic@yahoo.com> wrote:

> Hello,
>
> If I import data into HBase, can I still run a hand-written MapReduce job
> over that data in HDFS?
> That is, not using TableInputFormat to read the data back out via HBase.
>
> Similarly, can one run Hive or Pig scripts against that data, but again,
> without Hive or Pig reading the data via HBase, but rather getting to it
> directly via HDFS?  I'm asking because I'm wondering whether storing data in
> HBase means I can no longer use Hive and Pig to run my ad-hoc jobs.
>
> Thanks,
> Otis
> --
> Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch
>
>