You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Naama Kraus <na...@gmail.com> on 2008/09/03 14:04:45 UTC

HBase, Hive, Pig and other Hadoop based technologies

Hi,

There are various technologies on top of Hadoop such as HBase, Hive, Pig and
more. I was wondering what are the differences between them. What are the
usage scenarios that fit each one of them.

For instance, is it true to say that Pig and Hive belong to the same family
? Or is Hive more close to HBase ?
My understanding is that HBase allows direct lookup and low latency queries,
while Pig and Hive provide batch processing operations which are M/R based.
Both define a data model and an SQL-like query language. Is this true ?

Could anyone shed light on when to use each technology ? Main differences ?
Pros and Cons ?
Information on other technologies such as Jaql is also welcome.

Thanks, Naama

-- 
oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo
00 oo 00 oo
"If you want your children to be intelligent, read them fairy tales. If you
want them to be more intelligent, read them more fairy tales." (Albert
Einstein)

Re: HBase, Hive, Pig and other Hadoop based technologies

Posted by Naama Kraus <na...@gmail.com>.

Great, thanks, Naama

On Mon, Sep 8, 2008 at 7:09 PM, Jim Kellerman <ji...@powerset.com> wrote:

> Comments inline below:
> > -----Original Message-----
> > From: Naama Kraus [mailto:naamakraus@gmail.com]
> > Sent: Monday, September 08, 2008 1:00 AM
> > To: hadoop-core; hbase-user@hadoop.apache.org
> > Subject: Re: HBase, Hive, Pig and other Hadoop based technologies
> >
> > Both Pig and Hive are written on top of Hadoop, is that correct ?
>
> Yes.
>
> > Does this mean for instance that they would work with any implementation
> > of the Hadoop File System interface ? Whether it is HDFS, KFS, LocalFS,
> > or any future implementation that may be implemented ?
>
> Yes.
>
> > Is that true for HBase as well ?
>
> Yes.
>
> > I am assuming Pig and Hive use the Hadoop standard M/R framework and API,
> is
> > that right ?
>
> Yes.
>
> > Thanks, Naama
> >
> > On Wed, Sep 3, 2008 at 3:04 PM, Naama Kraus <na...@gmail.com>
> wrote:
> >
> > > Hi,
> > >
> > > There are various technologies on top of Hadoop such as HBase, Hive,
> Pig
> > > and more. I was wondering what are the differences between them. What
> are
> > > the usage scenarios that fit each one of them.
> > >
> > > For instance, is it true to say that Pig and Hive belong to the same
> family
> > > ? Or is Hive more close to HBase ?
> > > My understanding is that HBase allows direct lookup and low latency
> > > queries, while Pig and Hive provide batch processing operations which
> are
> > > M/R based. Both define a data model and an SQL-like query language. Is
> this
> > > true ?
> > >
> > > Could anyone shed light on when to use each technology ? Main
> differences ?
> > > Pros and Cons ?
> > > Information on other technologies such as Jaql is also welcome.
> > >
> > > Thanks, Naama
> > >
> > > --
> > > oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00
> oo
> > > 00 oo 00 oo
> > > "If you want your children to be intelligent, read them fairy tales. If
> you
> > > want them to be more intelligent, read them more fairy tales." (Albert
> > > Einstein)
> > >
> >
> >
> >
> > --
> > oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00
> oo
> > 00 oo 00 oo
> > "If you want your children to be intelligent, read them fairy tales. If
> you
> > want them to be more intelligent, read them more fairy tales." (Albert
> > Einstein)
>



-- 
oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo
00 oo 00 oo
"If you want your children to be intelligent, read them fairy tales. If you
want them to be more intelligent, read them more fairy tales." (Albert
Einstein)

RE: HBase, Hive, Pig and other Hadoop based technologies

Posted by Jim Kellerman <ji...@powerset.com>.

Comments inline below:
> -----Original Message-----
> From: Naama Kraus [mailto:naamakraus@gmail.com]
> Sent: Monday, September 08, 2008 1:00 AM
> To: hadoop-core; hbase-user@hadoop.apache.org
> Subject: Re: HBase, Hive, Pig and other Hadoop based technologies
>
> Both Pig and Hive are written on top of Hadoop, is that correct ?

Yes.

> Does this mean for instance that they would work with any implementation
> of the Hadoop File System interface ? Whether it is HDFS, KFS, LocalFS,
> or any future implementation that may be implemented ?

Yes.

> Is that true for HBase as well ?

Yes.

> I am assuming Pig and Hive use the Hadoop standard M/R framework and API, is
> that right ?

Yes.

> Thanks, Naama
>
> On Wed, Sep 3, 2008 at 3:04 PM, Naama Kraus <na...@gmail.com> wrote:
>
> > Hi,
> >
> > There are various technologies on top of Hadoop such as HBase, Hive, Pig
> > and more. I was wondering what are the differences between them. What are
> > the usage scenarios that fit each one of them.
> >
> > For instance, is it true to say that Pig and Hive belong to the same family
> > ? Or is Hive more close to HBase ?
> > My understanding is that HBase allows direct lookup and low latency
> > queries, while Pig and Hive provide batch processing operations which are
> > M/R based. Both define a data model and an SQL-like query language. Is this
> > true ?
> >
> > Could anyone shed light on when to use each technology ? Main differences ?
> > Pros and Cons ?
> > Information on other technologies such as Jaql is also welcome.
> >
> > Thanks, Naama
> >
> > --
> > oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo
> > 00 oo 00 oo
> > "If you want your children to be intelligent, read them fairy tales. If you
> > want them to be more intelligent, read them more fairy tales." (Albert
> > Einstein)
> >
>
>
>
> --
> oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo
> 00 oo 00 oo
> "If you want your children to be intelligent, read them fairy tales. If you
> want them to be more intelligent, read them more fairy tales." (Albert
> Einstein)

RE: HBase, Hive, Pig and other Hadoop based technologies

Posted by Jim Kellerman <ji...@powerset.com>.

Comments inline below:
> -----Original Message-----
> From: Naama Kraus [mailto:naamakraus@gmail.com]
> Sent: Monday, September 08, 2008 1:00 AM
> To: hadoop-core; hbase-user@hadoop.apache.org
> Subject: Re: HBase, Hive, Pig and other Hadoop based technologies
>
> Both Pig and Hive are written on top of Hadoop, is that correct ?

Yes.

> Does this mean for instance that they would work with any implementation
> of the Hadoop File System interface ? Whether it is HDFS, KFS, LocalFS,
> or any future implementation that may be implemented ?

Yes.

> Is that true for HBase as well ?

Yes.

> I am assuming Pig and Hive use the Hadoop standard M/R framework and API, is
> that right ?

Yes.

> Thanks, Naama
>
> On Wed, Sep 3, 2008 at 3:04 PM, Naama Kraus <na...@gmail.com> wrote:
>
> > Hi,
> >
> > There are various technologies on top of Hadoop such as HBase, Hive, Pig
> > and more. I was wondering what are the differences between them. What are
> > the usage scenarios that fit each one of them.
> >
> > For instance, is it true to say that Pig and Hive belong to the same family
> > ? Or is Hive more close to HBase ?
> > My understanding is that HBase allows direct lookup and low latency
> > queries, while Pig and Hive provide batch processing operations which are
> > M/R based. Both define a data model and an SQL-like query language. Is this
> > true ?
> >
> > Could anyone shed light on when to use each technology ? Main differences ?
> > Pros and Cons ?
> > Information on other technologies such as Jaql is also welcome.
> >
> > Thanks, Naama
> >
> > --
> > oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo
> > 00 oo 00 oo
> > "If you want your children to be intelligent, read them fairy tales. If you
> > want them to be more intelligent, read them more fairy tales." (Albert
> > Einstein)
> >
>
>
>
> --
> oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo
> 00 oo 00 oo
> "If you want your children to be intelligent, read them fairy tales. If you
> want them to be more intelligent, read them more fairy tales." (Albert
> Einstein)

Re: HBase, Hive, Pig and other Hadoop based technologies

Posted by Naama Kraus <na...@gmail.com>.

Both Pig and Hive are written on top of Hadoop, is that correct ? Does this
mean for instance that they would work with any implementation of the Hadoop
File System interface ? Whether it is HDFS, KFS, LocalFS, or any future
implementation that may be implemented ?
Is that true for HBase as well ?

I am assuming Pig and Hive use the Hadoop standard M/R framework and API, is
that right ?

Thanks, Naama

On Wed, Sep 3, 2008 at 3:04 PM, Naama Kraus <na...@gmail.com> wrote:

> Hi,
>
> There are various technologies on top of Hadoop such as HBase, Hive, Pig
> and more. I was wondering what are the differences between them. What are
> the usage scenarios that fit each one of them.
>
> For instance, is it true to say that Pig and Hive belong to the same family
> ? Or is Hive more close to HBase ?
> My understanding is that HBase allows direct lookup and low latency
> queries, while Pig and Hive provide batch processing operations which are
> M/R based. Both define a data model and an SQL-like query language. Is this
> true ?
>
> Could anyone shed light on when to use each technology ? Main differences ?
> Pros and Cons ?
> Information on other technologies such as Jaql is also welcome.
>
> Thanks, Naama
>
> --
> oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo
> 00 oo 00 oo
> "If you want your children to be intelligent, read them fairy tales. If you
> want them to be more intelligent, read them more fairy tales." (Albert
> Einstein)
>



-- 
oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo
00 oo 00 oo
"If you want your children to be intelligent, read them fairy tales. If you
want them to be more intelligent, read them more fairy tales." (Albert
Einstein)

Re: HBase, Hive, Pig and other Hadoop based technologies

Posted by Naama Kraus <na...@gmail.com>.

Both Pig and Hive are written on top of Hadoop, is that correct ? Does this
mean for instance that they would work with any implementation of the Hadoop
File System interface ? Whether it is HDFS, KFS, LocalFS, or any future
implementation that may be implemented ?
Is that true for HBase as well ?

I am assuming Pig and Hive use the Hadoop standard M/R framework and API, is
that right ?

Thanks, Naama

On Wed, Sep 3, 2008 at 3:04 PM, Naama Kraus <na...@gmail.com> wrote:

> Hi,
>
> There are various technologies on top of Hadoop such as HBase, Hive, Pig
> and more. I was wondering what are the differences between them. What are
> the usage scenarios that fit each one of them.
>
> For instance, is it true to say that Pig and Hive belong to the same family
> ? Or is Hive more close to HBase ?
> My understanding is that HBase allows direct lookup and low latency
> queries, while Pig and Hive provide batch processing operations which are
> M/R based. Both define a data model and an SQL-like query language. Is this
> true ?
>
> Could anyone shed light on when to use each technology ? Main differences ?
> Pros and Cons ?
> Information on other technologies such as Jaql is also welcome.
>
> Thanks, Naama
>
> --
> oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo
> 00 oo 00 oo
> "If you want your children to be intelligent, read them fairy tales. If you
> want them to be more intelligent, read them more fairy tales." (Albert
> Einstein)
>



-- 
oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo
00 oo 00 oo
"If you want your children to be intelligent, read them fairy tales. If you
want them to be more intelligent, read them more fairy tales." (Albert
Einstein)

Re: HBase, Hive, Pig and other Hadoop based technologies

Posted by Naama Kraus <na...@gmail.com>.

Thanks for the reference. Naama

On Thu, Sep 4, 2008 at 9:01 PM, olston <ol...@gmail.com> wrote:

>
> Hi,
>
> Our SIGMOD paper about Pig
> (http://www.cs.cmu.edu/~olston/publications/sigmod08.pdf<http://www.cs.cmu.edu/%7Eolston/publications/sigmod08.pdf>)
> compares Pig with
> SQL, and to a first approximation Hive is like SQL.
>
> Briefly: Pig lets users specify the order of operations directly ("plugging
> together pipes" as bloggers who are fans of map-reduce are fond of saying),
> as in map-reduce (but in Pig the pipes can be composed arbitrarily, and
> inward/outward branching is explicitly modeled). In contrast, SQL decouples
> the program specification from the order of operations, which gets
> determined by an automatic "query optimizer." For a more detailed
> discussion
> please see our paper referenced above.
>
> -Chris
>
>
> Naama Kraus wrote:
> >
> > Hi,
> >
> > There are various technologies on top of Hadoop such as HBase, Hive, Pig
> > and
> > more. I was wondering what are the differences between them. What are the
> > usage scenarios that fit each one of them.
> >
> > For instance, is it true to say that Pig and Hive belong to the same
> > family
> > ? Or is Hive more close to HBase ?
> > My understanding is that HBase allows direct lookup and low latency
> > queries,
> > while Pig and Hive provide batch processing operations which are M/R
> > based.
> > Both define a data model and an SQL-like query language. Is this true ?
> >
> > Could anyone shed light on when to use each technology ? Main differences
> > ?
> > Pros and Cons ?
> > Information on other technologies such as Jaql is also welcome.
> >
> > Thanks, Naama
> >
> > --
> > oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00
> oo
> > 00 oo 00 oo
> > "If you want your children to be intelligent, read them fairy tales. If
> > you
> > want them to be more intelligent, read them more fairy tales." (Albert
> > Einstein)
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/HBase%2C-Hive%2C-Pig-and-other-Hadoop-based-technologies-tp19287896p19316495.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>
>


-- 
oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo
00 oo 00 oo
"If you want your children to be intelligent, read them fairy tales. If you
want them to be more intelligent, read them more fairy tales." (Albert
Einstein)

Re: HBase, Hive, Pig and other Hadoop based technologies

Posted by olston <ol...@gmail.com>.

Hi,

Our SIGMOD paper about Pig
(http://www.cs.cmu.edu/~olston/publications/sigmod08.pdf) compares Pig with
SQL, and to a first approximation Hive is like SQL.

Briefly: Pig lets users specify the order of operations directly ("plugging
together pipes" as bloggers who are fans of map-reduce are fond of saying),
as in map-reduce (but in Pig the pipes can be composed arbitrarily, and
inward/outward branching is explicitly modeled). In contrast, SQL decouples
the program specification from the order of operations, which gets
determined by an automatic "query optimizer." For a more detailed discussion
please see our paper referenced above.

-Chris

Naama Kraus wrote:
> 
> Hi,
> 
> There are various technologies on top of Hadoop such as HBase, Hive, Pig
> and
> more. I was wondering what are the differences between them. What are the
> usage scenarios that fit each one of them.
> 
> For instance, is it true to say that Pig and Hive belong to the same
> family
> ? Or is Hive more close to HBase ?
> My understanding is that HBase allows direct lookup and low latency
> queries,
> while Pig and Hive provide batch processing operations which are M/R
> based.
> Both define a data model and an SQL-like query language. Is this true ?
> 
> Could anyone shed light on when to use each technology ? Main differences
> ?
> Pros and Cons ?
> Information on other technologies such as Jaql is also welcome.
> 
> Thanks, Naama
> 
> -- 
> oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo
> 00 oo 00 oo
> "If you want your children to be intelligent, read them fairy tales. If
> you
> want them to be more intelligent, read them more fairy tales." (Albert
> Einstein)
> 
> 

-- 
View this message in context: http://www.nabble.com/HBase%2C-Hive%2C-Pig-and-other-Hadoop-based-technologies-tp19287896p19316495.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.

RE: Re: Can not retrieve input file name using mapred.input.file

Posted by Yair Even-Zohar <ya...@revenuescience.com>.

The problem with this solution is that I get the exact match in
args[0].For example, when I enter a path like \tmp\files\* I get
inputFile=\tmp\files\*  instead of the actual file's name.
The reason I need the exact file name is that the file name contains
some information I need to use.

Thanks
-Yair

-----Original Message-----
From: news [mailto:news@ger.gmane.org] On Behalf Of Billy Pearson
Sent: Friday, September 05, 2008 3:24 AM
To: hbase-user@hadoop.apache.org
Subject: Re: Can not retrieve input file name using mapred.input.file

Try using

job.set(filename, args[0]);

then use

job.get(filename);

Billy




"Yair Even-Zohar" <ya...@revenuescience.com> 
wrote in message 
news:4B94F7D3090A974E94A9BD23E57BB143019CAA4D@corpdc-exch01.corp.digimin
e.com...
Hi
I'm running an uploader to Hbase using mapreduce and in the map
configure method I use:

public void configure(JobConf job) {
        mapTaskId = job.get("mapred.task.id");
        inputFile = job.get("mapred.input.file");
        custId="";
        if (inputFile!=null) {
        String splits[] = inputFile.split("_");
        if (splits.length > 2)
        custId = splits[1];
        }
      }


I get the mapTaskId correctly but am getting null inputFile.
I previously set

public JobConf createSubmittableJob(String[] args) {
    JobConf c = new JobConf(getConf(), ClogUploader.class);
    FileInputFormat.setInputPaths(c, new Path(args[0]));
}


Any suggestions?

Thanks
-Yair

Re: Can not retrieve input file name using mapred.input.file

Posted by Billy Pearson <sa...@pearsonwholesale.com>.

Try using

job.set(filename, args[0]);

then use

job.get(filename);

Billy




"Yair Even-Zohar" <ya...@revenuescience.com> 
wrote in message 
news:4B94F7D3090A974E94A9BD23E57BB143019CAA4D@corpdc-exch01.corp.digimine.com...
Hi
I'm running an uploader to Hbase using mapreduce and in the map
configure method I use:

public void configure(JobConf job) {
        mapTaskId = job.get("mapred.task.id");
        inputFile = job.get("mapred.input.file");
        custId="";
        if (inputFile!=null) {
        String splits[] = inputFile.split("_");
        if (splits.length > 2)
        custId = splits[1];
        }
      }


I get the mapTaskId correctly but am getting null inputFile.
I previously set

public JobConf createSubmittableJob(String[] args) {
    JobConf c = new JobConf(getConf(), ClogUploader.class);
    FileInputFormat.setInputPaths(c, new Path(args[0]));
}


Any suggestions?

Thanks
-Yair

Can not retrieve input file name using mapred.input.file

Posted by Yair Even-Zohar <ya...@revenuescience.com>.

Hi
I'm running an uploader to Hbase using mapreduce and in the map
configure method I use:

public void configure(JobConf job) {
        mapTaskId = job.get("mapred.task.id");
        inputFile = job.get("mapred.input.file");
        custId="";
        if (inputFile!=null) {
        	String splits[] = inputFile.split("_");
	        if (splits.length > 2)
	        	custId = splits[1];
        }
      }


I get the mapTaskId correctly but am getting null inputFile.
I previously set 

public JobConf createSubmittableJob(String[] args) {
	    JobConf c = new JobConf(getConf(), ClogUploader.class);
	    FileInputFormat.setInputPaths(c, new Path(args[0]));
}


Any suggestions?

Thanks
-Yair

Can not retrieve input file name using mapred.input.file

Posted by Yair Even-Zohar <ya...@revenuescience.com>.

Hi
I'm running an uploader to Hbase using mapreduce and in the map
configure method I use:

public void configure(JobConf job) {
        mapTaskId = job.get("mapred.task.id");
        inputFile = job.get("mapred.input.file");
        custId="";
        if (inputFile!=null) {
        	String splits[] = inputFile.split("_");
	        if (splits.length > 2)
	        	custId = splits[1];
        }
      }


I get the mapTaskId correctly but am getting null inputFile.
I previously set 

public JobConf createSubmittableJob(String[] args) {
	    JobConf c = new JobConf(getConf(), ClogUploader.class);
	    FileInputFormat.setInputPaths(c, new Path(args[0]));
}


Any suggestions?

Thanks
-Yair

Re: HBase, Hive, Pig and other Hadoop based technologies

Posted by Naama Kraus <na...@gmail.com>.

Thanks Jeff for the informative answer.

Naama

On Wed, Sep 3, 2008 at 5:41 PM, Jeff Hammerbacher <
jeff.hammerbacher@gmail.com> wrote:

> Hey Naama,
>
> There's quite a bit going on here, but I'll try to get the ball
> rolling on an explanation of similarities and differences:
>
> 1) Language for data retrieval
> Both Pig and Hive implement languages for data retrieval. Pig is aimed
> at "experienced programmers for performing ad-hoc analysis of
> extremely large data sets", and often these data sets are "temporary".
> These design points dictate that Pig be procedural, though they have
> chosen a somewhat SQL-like syntax. Hive, on the other hand, is aimed
> more at data analysts rather than engineers, and thus uses a
> declarative language with a syntax hews a bit closer to SQL. HBase
> offers a simpler API for getting and putting individual rows of data,
> thought I believe someone has written a (possibly unsupported)
> SQL-like retrieval language (HQL?) above HBase.
>
> 2) Schema management
> Pig requires you to specify the structure of your data with each
> query, while Hive and HBase provide separate processes which manage
> the schemas of your data.
>
> 3) Managed storage
> Pig is agnostic to how you lay your data out within HDFS. Hive can
> also work with unmanaged data in HDFS, but if you let Hive manage your
> data, it can do a little bit of optimization for retrieval by
> partitioning your data inside of the file system. HBase manages data
> layout for you in the file system.
>
> Additionally, I'd say looking at the design points of each system
> might be of help:
>
> -Pig was designed for experienced programmers performing ad-hoc data
> analysis
> -Hive was designed for business analysts and programmers, for use in a
> data warehousing environment
> -HBase was designed to enable point lookup in addition to MapReduce,
> and can possibly be used in OLTP-type applications where availability
> is not a concern (though the HBase folks have told me they intend it
> to be used primarily for OLAP-style workloads)
>
> Regards,
> Jeff
>
> On Wed, Sep 3, 2008 at 5:04 AM, Naama Kraus <na...@gmail.com> wrote:
> > Hi,
> >
> > There are various technologies on top of Hadoop such as HBase, Hive, Pig
> and
> > more. I was wondering what are the differences between them. What are the
> > usage scenarios that fit each one of them.
> >
> > For instance, is it true to say that Pig and Hive belong to the same
> family
> > ? Or is Hive more close to HBase ?
> > My understanding is that HBase allows direct lookup and low latency
> queries,
> > while Pig and Hive provide batch processing operations which are M/R
> based.
> > Both define a data model and an SQL-like query language. Is this true ?
> >
> > Could anyone shed light on when to use each technology ? Main differences
> ?
> > Pros and Cons ?
> > Information on other technologies such as Jaql is also welcome.
> >
> > Thanks, Naama
> >
> > --
> > oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00
> oo
> > 00 oo 00 oo
> > "If you want your children to be intelligent, read them fairy tales. If
> you
> > want them to be more intelligent, read them more fairy tales." (Albert
> > Einstein)
> >
>



-- 
oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo
00 oo 00 oo
"If you want your children to be intelligent, read them fairy tales. If you
want them to be more intelligent, read them more fairy tales." (Albert
Einstein)

Re: HBase, Hive, Pig and other Hadoop based technologies

Posted by Jeff Hammerbacher <je...@gmail.com>.

Hey Naama,

There's quite a bit going on here, but I'll try to get the ball
rolling on an explanation of similarities and differences:

1) Language for data retrieval
Both Pig and Hive implement languages for data retrieval. Pig is aimed
at "experienced programmers for performing ad-hoc analysis of
extremely large data sets", and often these data sets are "temporary".
These design points dictate that Pig be procedural, though they have
chosen a somewhat SQL-like syntax. Hive, on the other hand, is aimed
more at data analysts rather than engineers, and thus uses a
declarative language with a syntax hews a bit closer to SQL. HBase
offers a simpler API for getting and putting individual rows of data,
thought I believe someone has written a (possibly unsupported)
SQL-like retrieval language (HQL?) above HBase.

2) Schema management
Pig requires you to specify the structure of your data with each
query, while Hive and HBase provide separate processes which manage
the schemas of your data.

3) Managed storage
Pig is agnostic to how you lay your data out within HDFS. Hive can
also work with unmanaged data in HDFS, but if you let Hive manage your
data, it can do a little bit of optimization for retrieval by
partitioning your data inside of the file system. HBase manages data
layout for you in the file system.

Additionally, I'd say looking at the design points of each system
might be of help:

-Pig was designed for experienced programmers performing ad-hoc data analysis
-Hive was designed for business analysts and programmers, for use in a
data warehousing environment
-HBase was designed to enable point lookup in addition to MapReduce,
and can possibly be used in OLTP-type applications where availability
is not a concern (though the HBase folks have told me they intend it
to be used primarily for OLAP-style workloads)

Regards,
Jeff

On Wed, Sep 3, 2008 at 5:04 AM, Naama Kraus <na...@gmail.com> wrote:
> Hi,
>
> There are various technologies on top of Hadoop such as HBase, Hive, Pig and
> more. I was wondering what are the differences between them. What are the
> usage scenarios that fit each one of them.
>
> For instance, is it true to say that Pig and Hive belong to the same family
> ? Or is Hive more close to HBase ?
> My understanding is that HBase allows direct lookup and low latency queries,
> while Pig and Hive provide batch processing operations which are M/R based.
> Both define a data model and an SQL-like query language. Is this true ?
>
> Could anyone shed light on when to use each technology ? Main differences ?
> Pros and Cons ?
> Information on other technologies such as Jaql is also welcome.
>
> Thanks, Naama
>
> --
> oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo
> 00 oo 00 oo
> "If you want your children to be intelligent, read them fairy tales. If you
> want them to be more intelligent, read them more fairy tales." (Albert
> Einstein)
>

Re: HBase, Hive, Pig and other Hadoop based technologies

Posted by Jeff Hammerbacher <je...@gmail.com>.

Hey Naama,

There's quite a bit going on here, but I'll try to get the ball
rolling on an explanation of similarities and differences:

1) Language for data retrieval
Both Pig and Hive implement languages for data retrieval. Pig is aimed
at "experienced programmers for performing ad-hoc analysis of
extremely large data sets", and often these data sets are "temporary".
These design points dictate that Pig be procedural, though they have
chosen a somewhat SQL-like syntax. Hive, on the other hand, is aimed
more at data analysts rather than engineers, and thus uses a
declarative language with a syntax hews a bit closer to SQL. HBase
offers a simpler API for getting and putting individual rows of data,
thought I believe someone has written a (possibly unsupported)
SQL-like retrieval language (HQL?) above HBase.

2) Schema management
Pig requires you to specify the structure of your data with each
query, while Hive and HBase provide separate processes which manage
the schemas of your data.

3) Managed storage
Pig is agnostic to how you lay your data out within HDFS. Hive can
also work with unmanaged data in HDFS, but if you let Hive manage your
data, it can do a little bit of optimization for retrieval by
partitioning your data inside of the file system. HBase manages data
layout for you in the file system.

Additionally, I'd say looking at the design points of each system
might be of help:

-Pig was designed for experienced programmers performing ad-hoc data analysis
-Hive was designed for business analysts and programmers, for use in a
data warehousing environment
-HBase was designed to enable point lookup in addition to MapReduce,
and can possibly be used in OLTP-type applications where availability
is not a concern (though the HBase folks have told me they intend it
to be used primarily for OLAP-style workloads)

Regards,
Jeff

On Wed, Sep 3, 2008 at 5:04 AM, Naama Kraus <na...@gmail.com> wrote:
> Hi,
>
> There are various technologies on top of Hadoop such as HBase, Hive, Pig and
> more. I was wondering what are the differences between them. What are the
> usage scenarios that fit each one of them.
>
> For instance, is it true to say that Pig and Hive belong to the same family
> ? Or is Hive more close to HBase ?
> My understanding is that HBase allows direct lookup and low latency queries,
> while Pig and Hive provide batch processing operations which are M/R based.
> Both define a data model and an SQL-like query language. Is this true ?
>
> Could anyone shed light on when to use each technology ? Main differences ?
> Pros and Cons ?
> Information on other technologies such as Jaql is also welcome.
>
> Thanks, Naama
>
> --
> oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo
> 00 oo 00 oo
> "If you want your children to be intelligent, read them fairy tales. If you
> want them to be more intelligent, read them more fairy tales." (Albert
> Einstein)
>