You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hadoop.apache.org by Daniel Käfer <d....@hs-furtwangen.de> on 2012/10/25 21:24:20 UTC

reference architecture

Hello all,

I'm looking for a reference architecture for hadoop. The only result I
found is Lambda architecture from Nathan Marz[0].

With architecture I mean answers to question like:
- How should I store the data? CSV, Thirft, ProtoBuf
- How should I model the data? ER-Model, Starschema, something new?
- normalized or denormalized or both (master data normalized, then
transformation to denormalized, like ETL)
- How should i combine database and HDFS-Files?

Are there any other documented architectures for hadoop?

Regards
Daniel Käfer


[0] http://www.manning.com/marz/ just a preprint yet, not completed

Re: reference architecture

Posted by Russell Jurney <ru...@gmail.com>.

I define one of these in the book agile data, from O'Reilly. I express
opinions on all matters you query us about. But you don't have to take
my word for it...

It's a reading rainbow!

Jordi!

Russell Jurney http://datasyndrome.com

On Oct 27, 2012, at 1:09 AM, "Daniel Käfer" <d....@hs-furtwangen.de> wrote:

> Am Freitag, den 26.10.2012, 18:25 +0100 schrieb Steve Loughran:
>> Depends on the amount of data and expected use. If it's transient food
>> for the next MR jobs: HDFS
>
> Thanks for your help
>

Re: reference architecture

Posted by Russell Jurney <ru...@gmail.com>.

I define one of these in the book agile data, from O'Reilly. I express
opinions on all matters you query us about. But you don't have to take
my word for it...

It's a reading rainbow!

Jordi!

Russell Jurney http://datasyndrome.com

On Oct 27, 2012, at 1:09 AM, "Daniel Käfer" <d....@hs-furtwangen.de> wrote:

> Am Freitag, den 26.10.2012, 18:25 +0100 schrieb Steve Loughran:
>> Depends on the amount of data and expected use. If it's transient food
>> for the next MR jobs: HDFS
>
> Thanks for your help
>

Re: reference architecture

Posted by Russell Jurney <ru...@gmail.com>.

I define one of these in the book agile data, from O'Reilly. I express
opinions on all matters you query us about. But you don't have to take
my word for it...

It's a reading rainbow!

Jordi!

Russell Jurney http://datasyndrome.com

On Oct 27, 2012, at 1:09 AM, "Daniel Käfer" <d....@hs-furtwangen.de> wrote:

> Am Freitag, den 26.10.2012, 18:25 +0100 schrieb Steve Loughran:
>> Depends on the amount of data and expected use. If it's transient food
>> for the next MR jobs: HDFS
>
> Thanks for your help
>

Re: reference architecture

Posted by Russell Jurney <ru...@gmail.com>.

I define one of these in the book agile data, from O'Reilly. I express
opinions on all matters you query us about. But you don't have to take
my word for it...

It's a reading rainbow!

Jordi!

Russell Jurney http://datasyndrome.com

On Oct 27, 2012, at 1:09 AM, "Daniel Käfer" <d....@hs-furtwangen.de> wrote:

> Am Freitag, den 26.10.2012, 18:25 +0100 schrieb Steve Loughran:
>> Depends on the amount of data and expected use. If it's transient food
>> for the next MR jobs: HDFS
>
> Thanks for your help
>

Re: reference architecture

Posted by Daniel Käfer <d....@hs-furtwangen.de>.

Am Freitag, den 26.10.2012, 18:25 +0100 schrieb Steve Loughran:
> Depends on the amount of data and expected use. If it's transient food
> for the next MR jobs: HDFS 

Thanks for your help

Re: reference architecture

Posted by Daniel Käfer <d....@hs-furtwangen.de>.

Am Freitag, den 26.10.2012, 18:25 +0100 schrieb Steve Loughran:
> Depends on the amount of data and expected use. If it's transient food
> for the next MR jobs: HDFS 

Thanks for your help

Re: reference architecture

Posted by Daniel Käfer <d....@hs-furtwangen.de>.

Am Freitag, den 26.10.2012, 18:25 +0100 schrieb Steve Loughran:
> Depends on the amount of data and expected use. If it's transient food
> for the next MR jobs: HDFS 

Thanks for your help

Re: reference architecture

Posted by Daniel Käfer <d....@hs-furtwangen.de>.

Am Freitag, den 26.10.2012, 18:25 +0100 schrieb Steve Loughran:
> Depends on the amount of data and expected use. If it's transient food
> for the next MR jobs: HDFS 

Thanks for your help

Re: reference architecture

Posted by Steve Loughran <st...@hortonworks.com>.

On 25 October 2012 23:17, Daniel Käfer <d....@hs-furtwangen.de> wrote:

> Am Donnerstag, den 25.10.2012, 22:10 +0100 schrieb Steve Loughran:
> > Regarding storing DB data, HBase-on-HDFS is where people keep it; Pig
> > and Hive can work with that as well as rawer data kept in HDFS
> > directly
>
> But is that the best idea? HBase is great for random read and small
> range scan. But the Hive (SQL) performance is 4-5x slower than plain
> HDFS. [0]
>
>

> I guess first data (raw data) in HDFS and last data in HBase is a good
> idea. But how to store the data between individual mapreduce jobs?
>

Depends on the amount of data and expected use. If it's transient food for
the next MR jobs: HDFS

Re: reference architecture

Posted by Steve Loughran <st...@hortonworks.com>.

On 25 October 2012 23:17, Daniel Käfer <d....@hs-furtwangen.de> wrote:

> Am Donnerstag, den 25.10.2012, 22:10 +0100 schrieb Steve Loughran:
> > Regarding storing DB data, HBase-on-HDFS is where people keep it; Pig
> > and Hive can work with that as well as rawer data kept in HDFS
> > directly
>
> But is that the best idea? HBase is great for random read and small
> range scan. But the Hive (SQL) performance is 4-5x slower than plain
> HDFS. [0]
>
>

> I guess first data (raw data) in HDFS and last data in HBase is a good
> idea. But how to store the data between individual mapreduce jobs?
>

Depends on the amount of data and expected use. If it's transient food for
the next MR jobs: HDFS

Re: reference architecture

Posted by Steve Loughran <st...@hortonworks.com>.

On 25 October 2012 23:17, Daniel Käfer <d....@hs-furtwangen.de> wrote:

> Am Donnerstag, den 25.10.2012, 22:10 +0100 schrieb Steve Loughran:
> > Regarding storing DB data, HBase-on-HDFS is where people keep it; Pig
> > and Hive can work with that as well as rawer data kept in HDFS
> > directly
>
> But is that the best idea? HBase is great for random read and small
> range scan. But the Hive (SQL) performance is 4-5x slower than plain
> HDFS. [0]
>
>

> I guess first data (raw data) in HDFS and last data in HBase is a good
> idea. But how to store the data between individual mapreduce jobs?
>

Depends on the amount of data and expected use. If it's transient food for
the next MR jobs: HDFS

Re: reference architecture

Posted by Steve Loughran <st...@hortonworks.com>.

On 25 October 2012 23:17, Daniel Käfer <d....@hs-furtwangen.de> wrote:

> Am Donnerstag, den 25.10.2012, 22:10 +0100 schrieb Steve Loughran:
> > Regarding storing DB data, HBase-on-HDFS is where people keep it; Pig
> > and Hive can work with that as well as rawer data kept in HDFS
> > directly
>
> But is that the best idea? HBase is great for random read and small
> range scan. But the Hive (SQL) performance is 4-5x slower than plain
> HDFS. [0]
>
>

> I guess first data (raw data) in HDFS and last data in HBase is a good
> idea. But how to store the data between individual mapreduce jobs?
>

Depends on the amount of data and expected use. If it's transient food for
the next MR jobs: HDFS

Re: reference architecture

Posted by Daniel Käfer <d....@hs-furtwangen.de>.

Am Donnerstag, den 25.10.2012, 22:10 +0100 schrieb Steve Loughran:
> I quite like the new Hadoop in Practice for a lot of that, especially
> the answer to #2, "how to store the data", where he looks at all the
> options

The Part 3 Big Data Patterns looks very interesting. I am going to read
the book.

Am Donnerstag, den 25.10.2012, 22:10 +0100 schrieb Steve Loughran:
> Regarding storing DB data, HBase-on-HDFS is where people keep it; Pig
> and Hive can work with that as well as rawer data kept in HDFS
> directly

But is that the best idea? HBase is great for random read and small
range scan. But the Hive (SQL) performance is 4-5x slower than plain
HDFS. [0]

I guess first data (raw data) in HDFS and last data in HBase is a good
idea. But how to store the data between individual mapreduce jobs?


[0] Todd Lipcon
http://de.slideshare.net/cloudera/chicago-data-summit-apache-hbase-an-introduction
p.19 I don't benchmark the performance myself.
>

Re: reference architecture

Posted by Daniel Käfer <d....@hs-furtwangen.de>.

Am Donnerstag, den 25.10.2012, 22:10 +0100 schrieb Steve Loughran:
> I quite like the new Hadoop in Practice for a lot of that, especially
> the answer to #2, "how to store the data", where he looks at all the
> options

The Part 3 Big Data Patterns looks very interesting. I am going to read
the book.

Am Donnerstag, den 25.10.2012, 22:10 +0100 schrieb Steve Loughran:
> Regarding storing DB data, HBase-on-HDFS is where people keep it; Pig
> and Hive can work with that as well as rawer data kept in HDFS
> directly

But is that the best idea? HBase is great for random read and small
range scan. But the Hive (SQL) performance is 4-5x slower than plain
HDFS. [0]

I guess first data (raw data) in HDFS and last data in HBase is a good
idea. But how to store the data between individual mapreduce jobs?


[0] Todd Lipcon
http://de.slideshare.net/cloudera/chicago-data-summit-apache-hbase-an-introduction
p.19 I don't benchmark the performance myself.
>

Re: reference architecture

Posted by Daniel Käfer <d....@hs-furtwangen.de>.

Am Donnerstag, den 25.10.2012, 22:10 +0100 schrieb Steve Loughran:
> I quite like the new Hadoop in Practice for a lot of that, especially
> the answer to #2, "how to store the data", where he looks at all the
> options

The Part 3 Big Data Patterns looks very interesting. I am going to read
the book.

Am Donnerstag, den 25.10.2012, 22:10 +0100 schrieb Steve Loughran:
> Regarding storing DB data, HBase-on-HDFS is where people keep it; Pig
> and Hive can work with that as well as rawer data kept in HDFS
> directly

But is that the best idea? HBase is great for random read and small
range scan. But the Hive (SQL) performance is 4-5x slower than plain
HDFS. [0]

I guess first data (raw data) in HDFS and last data in HBase is a good
idea. But how to store the data between individual mapreduce jobs?


[0] Todd Lipcon
http://de.slideshare.net/cloudera/chicago-data-summit-apache-hbase-an-introduction
p.19 I don't benchmark the performance myself.
>

Re: reference architecture

Posted by Daniel Käfer <d....@hs-furtwangen.de>.

Am Donnerstag, den 25.10.2012, 22:10 +0100 schrieb Steve Loughran:
> I quite like the new Hadoop in Practice for a lot of that, especially
> the answer to #2, "how to store the data", where he looks at all the
> options

The Part 3 Big Data Patterns looks very interesting. I am going to read
the book.

Am Donnerstag, den 25.10.2012, 22:10 +0100 schrieb Steve Loughran:
> Regarding storing DB data, HBase-on-HDFS is where people keep it; Pig
> and Hive can work with that as well as rawer data kept in HDFS
> directly

But is that the best idea? HBase is great for random read and small
range scan. But the Hive (SQL) performance is 4-5x slower than plain
HDFS. [0]

I guess first data (raw data) in HDFS and last data in HBase is a good
idea. But how to store the data between individual mapreduce jobs?


[0] Todd Lipcon
http://de.slideshare.net/cloudera/chicago-data-summit-apache-hbase-an-introduction
p.19 I don't benchmark the performance myself.
>

Re: reference architecture

Posted by Steve Loughran <st...@hortonworks.com>.

On 25 October 2012 20:24, Daniel Käfer <d....@hs-furtwangen.de> wrote:

> Hello all,
>
> I'm looking for a reference architecture for hadoop. The only result I
> found is Lambda architecture from Nathan Marz[0].
>

I quite like the new Hadoop in Practice for a lot of that, especially the
answer to #2, "how to store the data", where he looks at all the options.
Joining is the other big issue.

http://steveloughran.blogspot.co.uk/2012/10/hadoop-in-practice-applied-hadoop.html

Regarding storing DB data, HBase-on-HDFS is where people keep it; Pig and
Hive can work with that as well as rawer data kept in HDFS directly


> With architecture I mean answers to question like:
> - How should I store the data? CSV, Thirft, ProtoBuf
> - How should I model the data? ER-Model, Starschema, something new?
> - normalized or denormalized or both (master data normalized, then
> transformation to denormalized, like ETL)
> - How should i combine database and HDFS-Files?
>
> Are there any other documented architectures for hadoop?
>
> Regards
> Daniel Käfer
>
>
> [0] http://www.manning.com/marz/ just a preprint yet, not completed
>
>

Re: reference architecture

Posted by Steve Loughran <st...@hortonworks.com>.

On 25 October 2012 20:24, Daniel Käfer <d....@hs-furtwangen.de> wrote:

> Hello all,
>
> I'm looking for a reference architecture for hadoop. The only result I
> found is Lambda architecture from Nathan Marz[0].
>

I quite like the new Hadoop in Practice for a lot of that, especially the
answer to #2, "how to store the data", where he looks at all the options.
Joining is the other big issue.

http://steveloughran.blogspot.co.uk/2012/10/hadoop-in-practice-applied-hadoop.html

Regarding storing DB data, HBase-on-HDFS is where people keep it; Pig and
Hive can work with that as well as rawer data kept in HDFS directly


> With architecture I mean answers to question like:
> - How should I store the data? CSV, Thirft, ProtoBuf
> - How should I model the data? ER-Model, Starschema, something new?
> - normalized or denormalized or both (master data normalized, then
> transformation to denormalized, like ETL)
> - How should i combine database and HDFS-Files?
>
> Are there any other documented architectures for hadoop?
>
> Regards
> Daniel Käfer
>
>
> [0] http://www.manning.com/marz/ just a preprint yet, not completed
>
>

Re: reference architecture

Posted by Steve Loughran <st...@hortonworks.com>.

On 25 October 2012 20:24, Daniel Käfer <d....@hs-furtwangen.de> wrote:

> Hello all,
>
> I'm looking for a reference architecture for hadoop. The only result I
> found is Lambda architecture from Nathan Marz[0].
>

I quite like the new Hadoop in Practice for a lot of that, especially the
answer to #2, "how to store the data", where he looks at all the options.
Joining is the other big issue.

http://steveloughran.blogspot.co.uk/2012/10/hadoop-in-practice-applied-hadoop.html

Regarding storing DB data, HBase-on-HDFS is where people keep it; Pig and
Hive can work with that as well as rawer data kept in HDFS directly


> With architecture I mean answers to question like:
> - How should I store the data? CSV, Thirft, ProtoBuf
> - How should I model the data? ER-Model, Starschema, something new?
> - normalized or denormalized or both (master data normalized, then
> transformation to denormalized, like ETL)
> - How should i combine database and HDFS-Files?
>
> Are there any other documented architectures for hadoop?
>
> Regards
> Daniel Käfer
>
>
> [0] http://www.manning.com/marz/ just a preprint yet, not completed
>
>

Re: reference architecture

Posted by Steve Loughran <st...@hortonworks.com>.

On 25 October 2012 20:24, Daniel Käfer <d....@hs-furtwangen.de> wrote:

> Hello all,
>
> I'm looking for a reference architecture for hadoop. The only result I
> found is Lambda architecture from Nathan Marz[0].
>

I quite like the new Hadoop in Practice for a lot of that, especially the
answer to #2, "how to store the data", where he looks at all the options.
Joining is the other big issue.

http://steveloughran.blogspot.co.uk/2012/10/hadoop-in-practice-applied-hadoop.html

Regarding storing DB data, HBase-on-HDFS is where people keep it; Pig and
Hive can work with that as well as rawer data kept in HDFS directly


> With architecture I mean answers to question like:
> - How should I store the data? CSV, Thirft, ProtoBuf
> - How should I model the data? ER-Model, Starschema, something new?
> - normalized or denormalized or both (master data normalized, then
> transformation to denormalized, like ETL)
> - How should i combine database and HDFS-Files?
>
> Are there any other documented architectures for hadoop?
>
> Regards
> Daniel Käfer
>
>
> [0] http://www.manning.com/marz/ just a preprint yet, not completed
>
>