You are viewing a plain text version of this content. The canonical link for it is here.

Posted to hdfs-user@hadoop.apache.org by Eli Finkelshteyn <ie...@gmail.com> on 2012/08/15 03:07:50 UTC

New Production Cluster Criticisms/Advice

Hey Folks,
I'm going to be setting up my first new production cluster soon, and was
hoping to get some advice and criticism on my current plan of action.
Here's my current plan:

*Background/Requirements:*
I'm setting this up for a start-up that's not gathering very big data yet,
but will be in the next few months (I hope, anyway). I'd like to use the
cluster for a few things, at least at first:
1. logging stuff it doesn't make sense to write to a normal database (as
well as duplicates of what I am throwing in my database so I can use that
stuff from HDFS later on). Basically, just logging a ton of information I
might want for analytics/model training later.
2. analytics processing.
3. model training (for machine learning). I'll primarily do this through
Mahout.
4. will probably want hbase on there as well for real time reading of some
data. I'm not married to this, and haven't played around much with hbase
yet, but wanted to leave the possibility open.

*The Plan:*
I'm thinking I'll set this up in Amazon. We have most of the rest of our
hardware there, and I really like the option to be able to spin up a bunch
of extra workers at will to have them train some ML model for me and then
kill them off. For now, just to get things off the ground, I'm going to
setup a small 4 machine cluster (1 NameNode, 1
SecondaryNameNode/JobTracker, 2 DataNode/TaskTrackers). I'll start playing
around with that setup, and will add more to it as needed. Since everything
will be puppetized, adding more machines shouldn't be too bad (I think).
I've been using Cloudera so far, and I haven't seen any good reason to
switch, so I'll use CDH4. For logging, I'll just use scribe and wind up
storing stuff as lzos (a good tutorial on the best way to do this would be
awesome).

Thoughts?

Eli

Re: New Production Cluster Criticisms/Advice

Posted by anil gupta <an...@gmail.com>.

My 2 cents on Hadoop version in Production:

If you think you will be deploying your stuff in prod in in 1-2 month then
you should note that cdh4 uses Hadoop-2.0.0-Alpha and "Alpha" release means
Hadoop-2.0.0 is not production ready. \. So you might need to make a call
on which cdh version to use(cdh3u3 or cdh4).

Personally, i have used both cdh3u2 and cdh4. Recently, i completed setting
up a fully distributed cluster of cdh4 with HA for Namenode, Zookeeper, and
HBase Master. HA for Namenode is a big advantage with Hadoop-2.0.0.

HTH,
Anil Gupta

On Tue, Aug 14, 2012 at 6:36 PM, Eli Finkelshteyn <ie...@gmail.com>wrote:

> Hey Mohammad,
> Thanks for the reply. I've been using Hadoop and Pig for a while, and I've
> setup a pseudo-cluster before. I've just never setup anything
> production-scale yet and wanted advice on that.
>
> Cheers,
>
>
> On Tue, Aug 14, 2012 at 6:20 PM, Mohammad Tariq <do...@gmail.com>wrote:
>
>> Hello Eli,
>>
>>     If this is your first time with Hadoop then I would suggest to
>> configure a cluster locally just to get yourself familiar with Hadoop(a
>> pseudo setup would do).
>>
>> For your analytical stuff you can have a look at Pig, another member of
>> the Hadoop ecosystem. It's a dataflow language that makes analytics really
>> easy.
>>
>> As a data store Hbase would definitely be a good move.
>>
>> For data aggregation, you can also have a look at Flume and Chukwa, apart
>> from Scribe.
>>
>> On Wednesday, August 15, 2012, Eli Finkelshteyn <ie...@gmail.com>
>> wrote:
>> > Hey Folks,
>> > I'm going to be setting up my first new production cluster soon, and
>> was hoping to get some advice and criticism on my current plan of action.
>> Here's my current plan:
>> > Background/Requirements:
>> > I'm setting this up for a start-up that's not gathering very big data
>> yet, but will be in the next few months (I hope, anyway). I'd like to use
>> the cluster for a few things, at least at first:
>> > 1. logging stuff it doesn't make sense to write to a normal database
>> (as well as duplicates of what I am throwing in my database so I can use
>> that stuff from HDFS later on). Basically, just logging a ton
>> of information I might want for analytics/model training later.
>> > 2. analytics processing.
>> > 3. model training (for machine learning). I'll primarily do this
>> through Mahout.
>> > 4. will probably want hbase on there as well for real time reading of
>> some data. I'm not married to this, and haven't played around much with
>> hbase yet, but wanted to leave the possibility open.
>> > The Plan:
>> > I'm thinking I'll set this up in Amazon. We have most of the rest of
>> our hardware there, and I really like the option to be able to spin up a
>> bunch of extra workers at will to have them train some ML model for me and
>> then kill them off. For now, just to get things off the ground, I'm going
>> to setup a small 4 machine cluster (1 NameNode, 1
>> SecondaryNameNode/JobTracker, 2 DataNode/TaskTrackers). I'll start playing
>> around with that setup, and will add more to it as needed. Since everything
>> will be puppetized, adding more machines shouldn't be too bad (I think).
>> I've been using Cloudera so far, and I haven't seen any good reason to
>> switch, so I'll use CDH4. For logging, I'll just use scribe and wind up
>> storing stuff as lzos (a good tutorial on the best way to do this would be
>> awesome).
>> > Thoughts?
>> > Eli
>>
>> --
>> Regards,
>>     Mohammad Tariq
>>
>>
>


-- 
Thanks & Regards,
Anil Gupta

Re: New Production Cluster Criticisms/Advice

Posted by Michael Segel <mi...@hotmail.com>.

Real clusters are a tad harder than the pseudo cluster.

You may want to consider EMR where you can choose between Amazon's Hadoop release (Its Apache), MapR M3 or MapR M5


On Aug 14, 2012, at 8:36 PM, Eli Finkelshteyn <ie...@gmail.com> wrote:

> Hey Mohammad,
> Thanks for the reply. I've been using Hadoop and Pig for a while, and I've setup a pseudo-cluster before. I've just never setup anything production-scale yet and wanted advice on that.
> 
> Cheers, 
> 
> On Tue, Aug 14, 2012 at 6:20 PM, Mohammad Tariq <do...@gmail.com> wrote:
> Hello Eli,
> 
>     If this is your first time with Hadoop then I would suggest to configure a cluster locally just to get yourself familiar with Hadoop(a pseudo setup would do).
> 
> For your analytical stuff you can have a look at Pig, another member of the Hadoop ecosystem. It's a dataflow language that makes analytics really easy.
> 
> As a data store Hbase would definitely be a good move.
> 
> For data aggregation, you can also have a look at Flume and Chukwa, apart from Scribe. 
> 
> On Wednesday, August 15, 2012, Eli Finkelshteyn <ie...@gmail.com> wrote:
> > Hey Folks,
> > I'm going to be setting up my first new production cluster soon, and was hoping to get some advice and criticism on my current plan of action. Here's my current plan:
> > Background/Requirements:
> > I'm setting this up for a start-up that's not gathering very big data yet, but will be in the next few months (I hope, anyway). I'd like to use the cluster for a few things, at least at first:
> > 1. logging stuff it doesn't make sense to write to a normal database (as well as duplicates of what I am throwing in my database so I can use that stuff from HDFS later on). Basically, just logging a ton of information I might want for analytics/model training later.
> > 2. analytics processing.
> > 3. model training (for machine learning). I'll primarily do this through Mahout.
> > 4. will probably want hbase on there as well for real time reading of some data. I'm not married to this, and haven't played around much with hbase yet, but wanted to leave the possibility open.
> > The Plan:
> > I'm thinking I'll set this up in Amazon. We have most of the rest of our hardware there, and I really like the option to be able to spin up a bunch of extra workers at will to have them train some ML model for me and then kill them off. For now, just to get things off the ground, I'm going to setup a small 4 machine cluster (1 NameNode, 1 SecondaryNameNode/JobTracker, 2 DataNode/TaskTrackers). I'll start playing around with that setup, and will add more to it as needed. Since everything will be puppetized, adding more machines shouldn't be too bad (I think). I've been using Cloudera so far, and I haven't seen any good reason to switch, so I'll use CDH4. For logging, I'll just use scribe and wind up storing stuff as lzos (a good tutorial on the best way to do this would be awesome).
> > Thoughts?
> > Eli
> 
> -- 
> Regards,
>     Mohammad Tariq
> 
>

Re: New Production Cluster Criticisms/Advice

Posted by anil gupta <an...@gmail.com>.

My 2 cents on Hadoop version in Production:

If you think you will be deploying your stuff in prod in in 1-2 month then
you should note that cdh4 uses Hadoop-2.0.0-Alpha and "Alpha" release means
Hadoop-2.0.0 is not production ready. \. So you might need to make a call
on which cdh version to use(cdh3u3 or cdh4).

Personally, i have used both cdh3u2 and cdh4. Recently, i completed setting
up a fully distributed cluster of cdh4 with HA for Namenode, Zookeeper, and
HBase Master. HA for Namenode is a big advantage with Hadoop-2.0.0.

HTH,
Anil Gupta

On Tue, Aug 14, 2012 at 6:36 PM, Eli Finkelshteyn <ie...@gmail.com>wrote:

> Hey Mohammad,
> Thanks for the reply. I've been using Hadoop and Pig for a while, and I've
> setup a pseudo-cluster before. I've just never setup anything
> production-scale yet and wanted advice on that.
>
> Cheers,
>
>
> On Tue, Aug 14, 2012 at 6:20 PM, Mohammad Tariq <do...@gmail.com>wrote:
>
>> Hello Eli,
>>
>>     If this is your first time with Hadoop then I would suggest to
>> configure a cluster locally just to get yourself familiar with Hadoop(a
>> pseudo setup would do).
>>
>> For your analytical stuff you can have a look at Pig, another member of
>> the Hadoop ecosystem. It's a dataflow language that makes analytics really
>> easy.
>>
>> As a data store Hbase would definitely be a good move.
>>
>> For data aggregation, you can also have a look at Flume and Chukwa, apart
>> from Scribe.
>>
>> On Wednesday, August 15, 2012, Eli Finkelshteyn <ie...@gmail.com>
>> wrote:
>> > Hey Folks,
>> > I'm going to be setting up my first new production cluster soon, and
>> was hoping to get some advice and criticism on my current plan of action.
>> Here's my current plan:
>> > Background/Requirements:
>> > I'm setting this up for a start-up that's not gathering very big data
>> yet, but will be in the next few months (I hope, anyway). I'd like to use
>> the cluster for a few things, at least at first:
>> > 1. logging stuff it doesn't make sense to write to a normal database
>> (as well as duplicates of what I am throwing in my database so I can use
>> that stuff from HDFS later on). Basically, just logging a ton
>> of information I might want for analytics/model training later.
>> > 2. analytics processing.
>> > 3. model training (for machine learning). I'll primarily do this
>> through Mahout.
>> > 4. will probably want hbase on there as well for real time reading of
>> some data. I'm not married to this, and haven't played around much with
>> hbase yet, but wanted to leave the possibility open.
>> > The Plan:
>> > I'm thinking I'll set this up in Amazon. We have most of the rest of
>> our hardware there, and I really like the option to be able to spin up a
>> bunch of extra workers at will to have them train some ML model for me and
>> then kill them off. For now, just to get things off the ground, I'm going
>> to setup a small 4 machine cluster (1 NameNode, 1
>> SecondaryNameNode/JobTracker, 2 DataNode/TaskTrackers). I'll start playing
>> around with that setup, and will add more to it as needed. Since everything
>> will be puppetized, adding more machines shouldn't be too bad (I think).
>> I've been using Cloudera so far, and I haven't seen any good reason to
>> switch, so I'll use CDH4. For logging, I'll just use scribe and wind up
>> storing stuff as lzos (a good tutorial on the best way to do this would be
>> awesome).
>> > Thoughts?
>> > Eli
>>
>> --
>> Regards,
>>     Mohammad Tariq
>>
>>
>


-- 
Thanks & Regards,
Anil Gupta

Re: New Production Cluster Criticisms/Advice

Posted by anil gupta <an...@gmail.com>.

My 2 cents on Hadoop version in Production:

If you think you will be deploying your stuff in prod in in 1-2 month then
you should note that cdh4 uses Hadoop-2.0.0-Alpha and "Alpha" release means
Hadoop-2.0.0 is not production ready. \. So you might need to make a call
on which cdh version to use(cdh3u3 or cdh4).

Personally, i have used both cdh3u2 and cdh4. Recently, i completed setting
up a fully distributed cluster of cdh4 with HA for Namenode, Zookeeper, and
HBase Master. HA for Namenode is a big advantage with Hadoop-2.0.0.

HTH,
Anil Gupta

On Tue, Aug 14, 2012 at 6:36 PM, Eli Finkelshteyn <ie...@gmail.com>wrote:

> Hey Mohammad,
> Thanks for the reply. I've been using Hadoop and Pig for a while, and I've
> setup a pseudo-cluster before. I've just never setup anything
> production-scale yet and wanted advice on that.
>
> Cheers,
>
>
> On Tue, Aug 14, 2012 at 6:20 PM, Mohammad Tariq <do...@gmail.com>wrote:
>
>> Hello Eli,
>>
>>     If this is your first time with Hadoop then I would suggest to
>> configure a cluster locally just to get yourself familiar with Hadoop(a
>> pseudo setup would do).
>>
>> For your analytical stuff you can have a look at Pig, another member of
>> the Hadoop ecosystem. It's a dataflow language that makes analytics really
>> easy.
>>
>> As a data store Hbase would definitely be a good move.
>>
>> For data aggregation, you can also have a look at Flume and Chukwa, apart
>> from Scribe.
>>
>> On Wednesday, August 15, 2012, Eli Finkelshteyn <ie...@gmail.com>
>> wrote:
>> > Hey Folks,
>> > I'm going to be setting up my first new production cluster soon, and
>> was hoping to get some advice and criticism on my current plan of action.
>> Here's my current plan:
>> > Background/Requirements:
>> > I'm setting this up for a start-up that's not gathering very big data
>> yet, but will be in the next few months (I hope, anyway). I'd like to use
>> the cluster for a few things, at least at first:
>> > 1. logging stuff it doesn't make sense to write to a normal database
>> (as well as duplicates of what I am throwing in my database so I can use
>> that stuff from HDFS later on). Basically, just logging a ton
>> of information I might want for analytics/model training later.
>> > 2. analytics processing.
>> > 3. model training (for machine learning). I'll primarily do this
>> through Mahout.
>> > 4. will probably want hbase on there as well for real time reading of
>> some data. I'm not married to this, and haven't played around much with
>> hbase yet, but wanted to leave the possibility open.
>> > The Plan:
>> > I'm thinking I'll set this up in Amazon. We have most of the rest of
>> our hardware there, and I really like the option to be able to spin up a
>> bunch of extra workers at will to have them train some ML model for me and
>> then kill them off. For now, just to get things off the ground, I'm going
>> to setup a small 4 machine cluster (1 NameNode, 1
>> SecondaryNameNode/JobTracker, 2 DataNode/TaskTrackers). I'll start playing
>> around with that setup, and will add more to it as needed. Since everything
>> will be puppetized, adding more machines shouldn't be too bad (I think).
>> I've been using Cloudera so far, and I haven't seen any good reason to
>> switch, so I'll use CDH4. For logging, I'll just use scribe and wind up
>> storing stuff as lzos (a good tutorial on the best way to do this would be
>> awesome).
>> > Thoughts?
>> > Eli
>>
>> --
>> Regards,
>>     Mohammad Tariq
>>
>>
>


-- 
Thanks & Regards,
Anil Gupta

Re: New Production Cluster Criticisms/Advice

Posted by Michael Segel <mi...@hotmail.com>.

Real clusters are a tad harder than the pseudo cluster.

You may want to consider EMR where you can choose between Amazon's Hadoop release (Its Apache), MapR M3 or MapR M5


On Aug 14, 2012, at 8:36 PM, Eli Finkelshteyn <ie...@gmail.com> wrote:

> Hey Mohammad,
> Thanks for the reply. I've been using Hadoop and Pig for a while, and I've setup a pseudo-cluster before. I've just never setup anything production-scale yet and wanted advice on that.
> 
> Cheers, 
> 
> On Tue, Aug 14, 2012 at 6:20 PM, Mohammad Tariq <do...@gmail.com> wrote:
> Hello Eli,
> 
>     If this is your first time with Hadoop then I would suggest to configure a cluster locally just to get yourself familiar with Hadoop(a pseudo setup would do).
> 
> For your analytical stuff you can have a look at Pig, another member of the Hadoop ecosystem. It's a dataflow language that makes analytics really easy.
> 
> As a data store Hbase would definitely be a good move.
> 
> For data aggregation, you can also have a look at Flume and Chukwa, apart from Scribe. 
> 
> On Wednesday, August 15, 2012, Eli Finkelshteyn <ie...@gmail.com> wrote:
> > Hey Folks,
> > I'm going to be setting up my first new production cluster soon, and was hoping to get some advice and criticism on my current plan of action. Here's my current plan:
> > Background/Requirements:
> > I'm setting this up for a start-up that's not gathering very big data yet, but will be in the next few months (I hope, anyway). I'd like to use the cluster for a few things, at least at first:
> > 1. logging stuff it doesn't make sense to write to a normal database (as well as duplicates of what I am throwing in my database so I can use that stuff from HDFS later on). Basically, just logging a ton of information I might want for analytics/model training later.
> > 2. analytics processing.
> > 3. model training (for machine learning). I'll primarily do this through Mahout.
> > 4. will probably want hbase on there as well for real time reading of some data. I'm not married to this, and haven't played around much with hbase yet, but wanted to leave the possibility open.
> > The Plan:
> > I'm thinking I'll set this up in Amazon. We have most of the rest of our hardware there, and I really like the option to be able to spin up a bunch of extra workers at will to have them train some ML model for me and then kill them off. For now, just to get things off the ground, I'm going to setup a small 4 machine cluster (1 NameNode, 1 SecondaryNameNode/JobTracker, 2 DataNode/TaskTrackers). I'll start playing around with that setup, and will add more to it as needed. Since everything will be puppetized, adding more machines shouldn't be too bad (I think). I've been using Cloudera so far, and I haven't seen any good reason to switch, so I'll use CDH4. For logging, I'll just use scribe and wind up storing stuff as lzos (a good tutorial on the best way to do this would be awesome).
> > Thoughts?
> > Eli
> 
> -- 
> Regards,
>     Mohammad Tariq
> 
>

Re: New Production Cluster Criticisms/Advice

Posted by Michael Segel <mi...@hotmail.com>.

Real clusters are a tad harder than the pseudo cluster.

You may want to consider EMR where you can choose between Amazon's Hadoop release (Its Apache), MapR M3 or MapR M5


On Aug 14, 2012, at 8:36 PM, Eli Finkelshteyn <ie...@gmail.com> wrote:

> Hey Mohammad,
> Thanks for the reply. I've been using Hadoop and Pig for a while, and I've setup a pseudo-cluster before. I've just never setup anything production-scale yet and wanted advice on that.
> 
> Cheers, 
> 
> On Tue, Aug 14, 2012 at 6:20 PM, Mohammad Tariq <do...@gmail.com> wrote:
> Hello Eli,
> 
>     If this is your first time with Hadoop then I would suggest to configure a cluster locally just to get yourself familiar with Hadoop(a pseudo setup would do).
> 
> For your analytical stuff you can have a look at Pig, another member of the Hadoop ecosystem. It's a dataflow language that makes analytics really easy.
> 
> As a data store Hbase would definitely be a good move.
> 
> For data aggregation, you can also have a look at Flume and Chukwa, apart from Scribe. 
> 
> On Wednesday, August 15, 2012, Eli Finkelshteyn <ie...@gmail.com> wrote:
> > Hey Folks,
> > I'm going to be setting up my first new production cluster soon, and was hoping to get some advice and criticism on my current plan of action. Here's my current plan:
> > Background/Requirements:
> > I'm setting this up for a start-up that's not gathering very big data yet, but will be in the next few months (I hope, anyway). I'd like to use the cluster for a few things, at least at first:
> > 1. logging stuff it doesn't make sense to write to a normal database (as well as duplicates of what I am throwing in my database so I can use that stuff from HDFS later on). Basically, just logging a ton of information I might want for analytics/model training later.
> > 2. analytics processing.
> > 3. model training (for machine learning). I'll primarily do this through Mahout.
> > 4. will probably want hbase on there as well for real time reading of some data. I'm not married to this, and haven't played around much with hbase yet, but wanted to leave the possibility open.
> > The Plan:
> > I'm thinking I'll set this up in Amazon. We have most of the rest of our hardware there, and I really like the option to be able to spin up a bunch of extra workers at will to have them train some ML model for me and then kill them off. For now, just to get things off the ground, I'm going to setup a small 4 machine cluster (1 NameNode, 1 SecondaryNameNode/JobTracker, 2 DataNode/TaskTrackers). I'll start playing around with that setup, and will add more to it as needed. Since everything will be puppetized, adding more machines shouldn't be too bad (I think). I've been using Cloudera so far, and I haven't seen any good reason to switch, so I'll use CDH4. For logging, I'll just use scribe and wind up storing stuff as lzos (a good tutorial on the best way to do this would be awesome).
> > Thoughts?
> > Eli
> 
> -- 
> Regards,
>     Mohammad Tariq
> 
>

Re: New Production Cluster Criticisms/Advice

Posted by anil gupta <an...@gmail.com>.

My 2 cents on Hadoop version in Production:

If you think you will be deploying your stuff in prod in in 1-2 month then
you should note that cdh4 uses Hadoop-2.0.0-Alpha and "Alpha" release means
Hadoop-2.0.0 is not production ready. \. So you might need to make a call
on which cdh version to use(cdh3u3 or cdh4).

Personally, i have used both cdh3u2 and cdh4. Recently, i completed setting
up a fully distributed cluster of cdh4 with HA for Namenode, Zookeeper, and
HBase Master. HA for Namenode is a big advantage with Hadoop-2.0.0.

HTH,
Anil Gupta

On Tue, Aug 14, 2012 at 6:36 PM, Eli Finkelshteyn <ie...@gmail.com>wrote:

> Hey Mohammad,
> Thanks for the reply. I've been using Hadoop and Pig for a while, and I've
> setup a pseudo-cluster before. I've just never setup anything
> production-scale yet and wanted advice on that.
>
> Cheers,
>
>
> On Tue, Aug 14, 2012 at 6:20 PM, Mohammad Tariq <do...@gmail.com>wrote:
>
>> Hello Eli,
>>
>>     If this is your first time with Hadoop then I would suggest to
>> configure a cluster locally just to get yourself familiar with Hadoop(a
>> pseudo setup would do).
>>
>> For your analytical stuff you can have a look at Pig, another member of
>> the Hadoop ecosystem. It's a dataflow language that makes analytics really
>> easy.
>>
>> As a data store Hbase would definitely be a good move.
>>
>> For data aggregation, you can also have a look at Flume and Chukwa, apart
>> from Scribe.
>>
>> On Wednesday, August 15, 2012, Eli Finkelshteyn <ie...@gmail.com>
>> wrote:
>> > Hey Folks,
>> > I'm going to be setting up my first new production cluster soon, and
>> was hoping to get some advice and criticism on my current plan of action.
>> Here's my current plan:
>> > Background/Requirements:
>> > I'm setting this up for a start-up that's not gathering very big data
>> yet, but will be in the next few months (I hope, anyway). I'd like to use
>> the cluster for a few things, at least at first:
>> > 1. logging stuff it doesn't make sense to write to a normal database
>> (as well as duplicates of what I am throwing in my database so I can use
>> that stuff from HDFS later on). Basically, just logging a ton
>> of information I might want for analytics/model training later.
>> > 2. analytics processing.
>> > 3. model training (for machine learning). I'll primarily do this
>> through Mahout.
>> > 4. will probably want hbase on there as well for real time reading of
>> some data. I'm not married to this, and haven't played around much with
>> hbase yet, but wanted to leave the possibility open.
>> > The Plan:
>> > I'm thinking I'll set this up in Amazon. We have most of the rest of
>> our hardware there, and I really like the option to be able to spin up a
>> bunch of extra workers at will to have them train some ML model for me and
>> then kill them off. For now, just to get things off the ground, I'm going
>> to setup a small 4 machine cluster (1 NameNode, 1
>> SecondaryNameNode/JobTracker, 2 DataNode/TaskTrackers). I'll start playing
>> around with that setup, and will add more to it as needed. Since everything
>> will be puppetized, adding more machines shouldn't be too bad (I think).
>> I've been using Cloudera so far, and I haven't seen any good reason to
>> switch, so I'll use CDH4. For logging, I'll just use scribe and wind up
>> storing stuff as lzos (a good tutorial on the best way to do this would be
>> awesome).
>> > Thoughts?
>> > Eli
>>
>> --
>> Regards,
>>     Mohammad Tariq
>>
>>
>


-- 
Thanks & Regards,
Anil Gupta

Re: New Production Cluster Criticisms/Advice

Posted by Michael Segel <mi...@hotmail.com>.

Real clusters are a tad harder than the pseudo cluster.

You may want to consider EMR where you can choose between Amazon's Hadoop release (Its Apache), MapR M3 or MapR M5


On Aug 14, 2012, at 8:36 PM, Eli Finkelshteyn <ie...@gmail.com> wrote:

> Hey Mohammad,
> Thanks for the reply. I've been using Hadoop and Pig for a while, and I've setup a pseudo-cluster before. I've just never setup anything production-scale yet and wanted advice on that.
> 
> Cheers, 
> 
> On Tue, Aug 14, 2012 at 6:20 PM, Mohammad Tariq <do...@gmail.com> wrote:
> Hello Eli,
> 
>     If this is your first time with Hadoop then I would suggest to configure a cluster locally just to get yourself familiar with Hadoop(a pseudo setup would do).
> 
> For your analytical stuff you can have a look at Pig, another member of the Hadoop ecosystem. It's a dataflow language that makes analytics really easy.
> 
> As a data store Hbase would definitely be a good move.
> 
> For data aggregation, you can also have a look at Flume and Chukwa, apart from Scribe. 
> 
> On Wednesday, August 15, 2012, Eli Finkelshteyn <ie...@gmail.com> wrote:
> > Hey Folks,
> > I'm going to be setting up my first new production cluster soon, and was hoping to get some advice and criticism on my current plan of action. Here's my current plan:
> > Background/Requirements:
> > I'm setting this up for a start-up that's not gathering very big data yet, but will be in the next few months (I hope, anyway). I'd like to use the cluster for a few things, at least at first:
> > 1. logging stuff it doesn't make sense to write to a normal database (as well as duplicates of what I am throwing in my database so I can use that stuff from HDFS later on). Basically, just logging a ton of information I might want for analytics/model training later.
> > 2. analytics processing.
> > 3. model training (for machine learning). I'll primarily do this through Mahout.
> > 4. will probably want hbase on there as well for real time reading of some data. I'm not married to this, and haven't played around much with hbase yet, but wanted to leave the possibility open.
> > The Plan:
> > I'm thinking I'll set this up in Amazon. We have most of the rest of our hardware there, and I really like the option to be able to spin up a bunch of extra workers at will to have them train some ML model for me and then kill them off. For now, just to get things off the ground, I'm going to setup a small 4 machine cluster (1 NameNode, 1 SecondaryNameNode/JobTracker, 2 DataNode/TaskTrackers). I'll start playing around with that setup, and will add more to it as needed. Since everything will be puppetized, adding more machines shouldn't be too bad (I think). I've been using Cloudera so far, and I haven't seen any good reason to switch, so I'll use CDH4. For logging, I'll just use scribe and wind up storing stuff as lzos (a good tutorial on the best way to do this would be awesome).
> > Thoughts?
> > Eli
> 
> -- 
> Regards,
>     Mohammad Tariq
> 
>

Re: New Production Cluster Criticisms/Advice

Posted by Eli Finkelshteyn <ie...@gmail.com>.

Hey Mohammad,
Thanks for the reply. I've been using Hadoop and Pig for a while, and I've
setup a pseudo-cluster before. I've just never setup anything
production-scale yet and wanted advice on that.

Cheers,

On Tue, Aug 14, 2012 at 6:20 PM, Mohammad Tariq <do...@gmail.com> wrote:

> Hello Eli,
>
>     If this is your first time with Hadoop then I would suggest to
> configure a cluster locally just to get yourself familiar with Hadoop(a
> pseudo setup would do).
>
> For your analytical stuff you can have a look at Pig, another member of
> the Hadoop ecosystem. It's a dataflow language that makes analytics really
> easy.
>
> As a data store Hbase would definitely be a good move.
>
> For data aggregation, you can also have a look at Flume and Chukwa, apart
> from Scribe.
>
> On Wednesday, August 15, 2012, Eli Finkelshteyn <ie...@gmail.com>
> wrote:
> > Hey Folks,
> > I'm going to be setting up my first new production cluster soon, and was
> hoping to get some advice and criticism on my current plan of action.
> Here's my current plan:
> > Background/Requirements:
> > I'm setting this up for a start-up that's not gathering very big data
> yet, but will be in the next few months (I hope, anyway). I'd like to use
> the cluster for a few things, at least at first:
> > 1. logging stuff it doesn't make sense to write to a normal database (as
> well as duplicates of what I am throwing in my database so I can use that
> stuff from HDFS later on). Basically, just logging a ton of information I
> might want for analytics/model training later.
> > 2. analytics processing.
> > 3. model training (for machine learning). I'll primarily do this through
> Mahout.
> > 4. will probably want hbase on there as well for real time reading of
> some data. I'm not married to this, and haven't played around much with
> hbase yet, but wanted to leave the possibility open.
> > The Plan:
> > I'm thinking I'll set this up in Amazon. We have most of the rest of our
> hardware there, and I really like the option to be able to spin up a bunch
> of extra workers at will to have them train some ML model for me and then
> kill them off. For now, just to get things off the ground, I'm going to
> setup a small 4 machine cluster (1 NameNode, 1
> SecondaryNameNode/JobTracker, 2 DataNode/TaskTrackers). I'll start playing
> around with that setup, and will add more to it as needed. Since everything
> will be puppetized, adding more machines shouldn't be too bad (I think).
> I've been using Cloudera so far, and I haven't seen any good reason to
> switch, so I'll use CDH4. For logging, I'll just use scribe and wind up
> storing stuff as lzos (a good tutorial on the best way to do this would be
> awesome).
> > Thoughts?
> > Eli
>
> --
> Regards,
>     Mohammad Tariq
>
>

Re: New Production Cluster Criticisms/Advice

Posted by Eli Finkelshteyn <ie...@gmail.com>.

Hey Mohammad,
Thanks for the reply. I've been using Hadoop and Pig for a while, and I've
setup a pseudo-cluster before. I've just never setup anything
production-scale yet and wanted advice on that.

Cheers,

On Tue, Aug 14, 2012 at 6:20 PM, Mohammad Tariq <do...@gmail.com> wrote:

> Hello Eli,
>
>     If this is your first time with Hadoop then I would suggest to
> configure a cluster locally just to get yourself familiar with Hadoop(a
> pseudo setup would do).
>
> For your analytical stuff you can have a look at Pig, another member of
> the Hadoop ecosystem. It's a dataflow language that makes analytics really
> easy.
>
> As a data store Hbase would definitely be a good move.
>
> For data aggregation, you can also have a look at Flume and Chukwa, apart
> from Scribe.
>
> On Wednesday, August 15, 2012, Eli Finkelshteyn <ie...@gmail.com>
> wrote:
> > Hey Folks,
> > I'm going to be setting up my first new production cluster soon, and was
> hoping to get some advice and criticism on my current plan of action.
> Here's my current plan:
> > Background/Requirements:
> > I'm setting this up for a start-up that's not gathering very big data
> yet, but will be in the next few months (I hope, anyway). I'd like to use
> the cluster for a few things, at least at first:
> > 1. logging stuff it doesn't make sense to write to a normal database (as
> well as duplicates of what I am throwing in my database so I can use that
> stuff from HDFS later on). Basically, just logging a ton of information I
> might want for analytics/model training later.
> > 2. analytics processing.
> > 3. model training (for machine learning). I'll primarily do this through
> Mahout.
> > 4. will probably want hbase on there as well for real time reading of
> some data. I'm not married to this, and haven't played around much with
> hbase yet, but wanted to leave the possibility open.
> > The Plan:
> > I'm thinking I'll set this up in Amazon. We have most of the rest of our
> hardware there, and I really like the option to be able to spin up a bunch
> of extra workers at will to have them train some ML model for me and then
> kill them off. For now, just to get things off the ground, I'm going to
> setup a small 4 machine cluster (1 NameNode, 1
> SecondaryNameNode/JobTracker, 2 DataNode/TaskTrackers). I'll start playing
> around with that setup, and will add more to it as needed. Since everything
> will be puppetized, adding more machines shouldn't be too bad (I think).
> I've been using Cloudera so far, and I haven't seen any good reason to
> switch, so I'll use CDH4. For logging, I'll just use scribe and wind up
> storing stuff as lzos (a good tutorial on the best way to do this would be
> awesome).
> > Thoughts?
> > Eli
>
> --
> Regards,
>     Mohammad Tariq
>
>

Re: New Production Cluster Criticisms/Advice

Posted by Eli Finkelshteyn <ie...@gmail.com>.

Hey Mohammad,
Thanks for the reply. I've been using Hadoop and Pig for a while, and I've
setup a pseudo-cluster before. I've just never setup anything
production-scale yet and wanted advice on that.

Cheers,

On Tue, Aug 14, 2012 at 6:20 PM, Mohammad Tariq <do...@gmail.com> wrote:

> Hello Eli,
>
>     If this is your first time with Hadoop then I would suggest to
> configure a cluster locally just to get yourself familiar with Hadoop(a
> pseudo setup would do).
>
> For your analytical stuff you can have a look at Pig, another member of
> the Hadoop ecosystem. It's a dataflow language that makes analytics really
> easy.
>
> As a data store Hbase would definitely be a good move.
>
> For data aggregation, you can also have a look at Flume and Chukwa, apart
> from Scribe.
>
> On Wednesday, August 15, 2012, Eli Finkelshteyn <ie...@gmail.com>
> wrote:
> > Hey Folks,
> > I'm going to be setting up my first new production cluster soon, and was
> hoping to get some advice and criticism on my current plan of action.
> Here's my current plan:
> > Background/Requirements:
> > I'm setting this up for a start-up that's not gathering very big data
> yet, but will be in the next few months (I hope, anyway). I'd like to use
> the cluster for a few things, at least at first:
> > 1. logging stuff it doesn't make sense to write to a normal database (as
> well as duplicates of what I am throwing in my database so I can use that
> stuff from HDFS later on). Basically, just logging a ton of information I
> might want for analytics/model training later.
> > 2. analytics processing.
> > 3. model training (for machine learning). I'll primarily do this through
> Mahout.
> > 4. will probably want hbase on there as well for real time reading of
> some data. I'm not married to this, and haven't played around much with
> hbase yet, but wanted to leave the possibility open.
> > The Plan:
> > I'm thinking I'll set this up in Amazon. We have most of the rest of our
> hardware there, and I really like the option to be able to spin up a bunch
> of extra workers at will to have them train some ML model for me and then
> kill them off. For now, just to get things off the ground, I'm going to
> setup a small 4 machine cluster (1 NameNode, 1
> SecondaryNameNode/JobTracker, 2 DataNode/TaskTrackers). I'll start playing
> around with that setup, and will add more to it as needed. Since everything
> will be puppetized, adding more machines shouldn't be too bad (I think).
> I've been using Cloudera so far, and I haven't seen any good reason to
> switch, so I'll use CDH4. For logging, I'll just use scribe and wind up
> storing stuff as lzos (a good tutorial on the best way to do this would be
> awesome).
> > Thoughts?
> > Eli
>
> --
> Regards,
>     Mohammad Tariq
>
>

Re: New Production Cluster Criticisms/Advice

Posted by Eli Finkelshteyn <ie...@gmail.com>.

Hey Mohammad,
Thanks for the reply. I've been using Hadoop and Pig for a while, and I've
setup a pseudo-cluster before. I've just never setup anything
production-scale yet and wanted advice on that.

Cheers,

On Tue, Aug 14, 2012 at 6:20 PM, Mohammad Tariq <do...@gmail.com> wrote:

> Hello Eli,
>
>     If this is your first time with Hadoop then I would suggest to
> configure a cluster locally just to get yourself familiar with Hadoop(a
> pseudo setup would do).
>
> For your analytical stuff you can have a look at Pig, another member of
> the Hadoop ecosystem. It's a dataflow language that makes analytics really
> easy.
>
> As a data store Hbase would definitely be a good move.
>
> For data aggregation, you can also have a look at Flume and Chukwa, apart
> from Scribe.
>
> On Wednesday, August 15, 2012, Eli Finkelshteyn <ie...@gmail.com>
> wrote:
> > Hey Folks,
> > I'm going to be setting up my first new production cluster soon, and was
> hoping to get some advice and criticism on my current plan of action.
> Here's my current plan:
> > Background/Requirements:
> > I'm setting this up for a start-up that's not gathering very big data
> yet, but will be in the next few months (I hope, anyway). I'd like to use
> the cluster for a few things, at least at first:
> > 1. logging stuff it doesn't make sense to write to a normal database (as
> well as duplicates of what I am throwing in my database so I can use that
> stuff from HDFS later on). Basically, just logging a ton of information I
> might want for analytics/model training later.
> > 2. analytics processing.
> > 3. model training (for machine learning). I'll primarily do this through
> Mahout.
> > 4. will probably want hbase on there as well for real time reading of
> some data. I'm not married to this, and haven't played around much with
> hbase yet, but wanted to leave the possibility open.
> > The Plan:
> > I'm thinking I'll set this up in Amazon. We have most of the rest of our
> hardware there, and I really like the option to be able to spin up a bunch
> of extra workers at will to have them train some ML model for me and then
> kill them off. For now, just to get things off the ground, I'm going to
> setup a small 4 machine cluster (1 NameNode, 1
> SecondaryNameNode/JobTracker, 2 DataNode/TaskTrackers). I'll start playing
> around with that setup, and will add more to it as needed. Since everything
> will be puppetized, adding more machines shouldn't be too bad (I think).
> I've been using Cloudera so far, and I haven't seen any good reason to
> switch, so I'll use CDH4. For logging, I'll just use scribe and wind up
> storing stuff as lzos (a good tutorial on the best way to do this would be
> awesome).
> > Thoughts?
> > Eli
>
> --
> Regards,
>     Mohammad Tariq
>
>

Re: New Production Cluster Criticisms/Advice

Posted by Mohammad Tariq <do...@gmail.com>.

Hello Eli,

    If this is your first time with Hadoop then I would suggest to
configure a cluster locally just to get yourself familiar with Hadoop(a
pseudo setup would do).

For your analytical stuff you can have a look at Pig, another member of the
Hadoop ecosystem. It's a dataflow language that makes analytics really easy.

As a data store Hbase would definitely be a good move.

For data aggregation, you can also have a look at Flume and Chukwa, apart
from Scribe.

On Wednesday, August 15, 2012, Eli Finkelshteyn <ie...@gmail.com> wrote:
> Hey Folks,
> I'm going to be setting up my first new production cluster soon, and was
hoping to get some advice and criticism on my current plan of action.
Here's my current plan:
> Background/Requirements:
> I'm setting this up for a start-up that's not gathering very big data
yet, but will be in the next few months (I hope, anyway). I'd like to use
the cluster for a few things, at least at first:
> 1. logging stuff it doesn't make sense to write to a normal database (as
well as duplicates of what I am throwing in my database so I can use that
stuff from HDFS later on). Basically, just logging a ton of information I
might want for analytics/model training later.
> 2. analytics processing.
> 3. model training (for machine learning). I'll primarily do this through
Mahout.
> 4. will probably want hbase on there as well for real time reading of
some data. I'm not married to this, and haven't played around much with
hbase yet, but wanted to leave the possibility open.
> The Plan:
> I'm thinking I'll set this up in Amazon. We have most of the rest of our
hardware there, and I really like the option to be able to spin up a bunch
of extra workers at will to have them train some ML model for me and then
kill them off. For now, just to get things off the ground, I'm going to
setup a small 4 machine cluster (1 NameNode, 1
SecondaryNameNode/JobTracker, 2 DataNode/TaskTrackers). I'll start playing
around with that setup, and will add more to it as needed. Since everything
will be puppetized, adding more machines shouldn't be too bad (I think).
I've been using Cloudera so far, and I haven't seen any good reason to
switch, so I'll use CDH4. For logging, I'll just use scribe and wind up
storing stuff as lzos (a good tutorial on the best way to do this would be
awesome).
> Thoughts?
> Eli

-- 
Regards,
    Mohammad Tariq

Re: New Production Cluster Criticisms/Advice

Posted by Mohammad Tariq <do...@gmail.com>.

Hello Eli,

    If this is your first time with Hadoop then I would suggest to
configure a cluster locally just to get yourself familiar with Hadoop(a
pseudo setup would do).

For your analytical stuff you can have a look at Pig, another member of the
Hadoop ecosystem. It's a dataflow language that makes analytics really easy.

As a data store Hbase would definitely be a good move.

For data aggregation, you can also have a look at Flume and Chukwa, apart
from Scribe.

On Wednesday, August 15, 2012, Eli Finkelshteyn <ie...@gmail.com> wrote:
> Hey Folks,
> I'm going to be setting up my first new production cluster soon, and was
hoping to get some advice and criticism on my current plan of action.
Here's my current plan:
> Background/Requirements:
> I'm setting this up for a start-up that's not gathering very big data
yet, but will be in the next few months (I hope, anyway). I'd like to use
the cluster for a few things, at least at first:
> 1. logging stuff it doesn't make sense to write to a normal database (as
well as duplicates of what I am throwing in my database so I can use that
stuff from HDFS later on). Basically, just logging a ton of information I
might want for analytics/model training later.
> 2. analytics processing.
> 3. model training (for machine learning). I'll primarily do this through
Mahout.
> 4. will probably want hbase on there as well for real time reading of
some data. I'm not married to this, and haven't played around much with
hbase yet, but wanted to leave the possibility open.
> The Plan:
> I'm thinking I'll set this up in Amazon. We have most of the rest of our
hardware there, and I really like the option to be able to spin up a bunch
of extra workers at will to have them train some ML model for me and then
kill them off. For now, just to get things off the ground, I'm going to
setup a small 4 machine cluster (1 NameNode, 1
SecondaryNameNode/JobTracker, 2 DataNode/TaskTrackers). I'll start playing
around with that setup, and will add more to it as needed. Since everything
will be puppetized, adding more machines shouldn't be too bad (I think).
I've been using Cloudera so far, and I haven't seen any good reason to
switch, so I'll use CDH4. For logging, I'll just use scribe and wind up
storing stuff as lzos (a good tutorial on the best way to do this would be
awesome).
> Thoughts?
> Eli

-- 
Regards,
    Mohammad Tariq

Re: New Production Cluster Criticisms/Advice

Posted by Mohammad Tariq <do...@gmail.com>.

Hello Eli,

    If this is your first time with Hadoop then I would suggest to
configure a cluster locally just to get yourself familiar with Hadoop(a
pseudo setup would do).

For your analytical stuff you can have a look at Pig, another member of the
Hadoop ecosystem. It's a dataflow language that makes analytics really easy.

As a data store Hbase would definitely be a good move.

For data aggregation, you can also have a look at Flume and Chukwa, apart
from Scribe.

On Wednesday, August 15, 2012, Eli Finkelshteyn <ie...@gmail.com> wrote:
> Hey Folks,
> I'm going to be setting up my first new production cluster soon, and was
hoping to get some advice and criticism on my current plan of action.
Here's my current plan:
> Background/Requirements:
> I'm setting this up for a start-up that's not gathering very big data
yet, but will be in the next few months (I hope, anyway). I'd like to use
the cluster for a few things, at least at first:
> 1. logging stuff it doesn't make sense to write to a normal database (as
well as duplicates of what I am throwing in my database so I can use that
stuff from HDFS later on). Basically, just logging a ton of information I
might want for analytics/model training later.
> 2. analytics processing.
> 3. model training (for machine learning). I'll primarily do this through
Mahout.
> 4. will probably want hbase on there as well for real time reading of
some data. I'm not married to this, and haven't played around much with
hbase yet, but wanted to leave the possibility open.
> The Plan:
> I'm thinking I'll set this up in Amazon. We have most of the rest of our
hardware there, and I really like the option to be able to spin up a bunch
of extra workers at will to have them train some ML model for me and then
kill them off. For now, just to get things off the ground, I'm going to
setup a small 4 machine cluster (1 NameNode, 1
SecondaryNameNode/JobTracker, 2 DataNode/TaskTrackers). I'll start playing
around with that setup, and will add more to it as needed. Since everything
will be puppetized, adding more machines shouldn't be too bad (I think).
I've been using Cloudera so far, and I haven't seen any good reason to
switch, so I'll use CDH4. For logging, I'll just use scribe and wind up
storing stuff as lzos (a good tutorial on the best way to do this would be
awesome).
> Thoughts?
> Eli

-- 
Regards,
    Mohammad Tariq

Re: New Production Cluster Criticisms/Advice

Posted by Mohammad Tariq <do...@gmail.com>.

Hello Eli,

    If this is your first time with Hadoop then I would suggest to
configure a cluster locally just to get yourself familiar with Hadoop(a
pseudo setup would do).

For your analytical stuff you can have a look at Pig, another member of the
Hadoop ecosystem. It's a dataflow language that makes analytics really easy.

As a data store Hbase would definitely be a good move.

For data aggregation, you can also have a look at Flume and Chukwa, apart
from Scribe.

On Wednesday, August 15, 2012, Eli Finkelshteyn <ie...@gmail.com> wrote:
> Hey Folks,
> I'm going to be setting up my first new production cluster soon, and was
hoping to get some advice and criticism on my current plan of action.
Here's my current plan:
> Background/Requirements:
> I'm setting this up for a start-up that's not gathering very big data
yet, but will be in the next few months (I hope, anyway). I'd like to use
the cluster for a few things, at least at first:
> 1. logging stuff it doesn't make sense to write to a normal database (as
well as duplicates of what I am throwing in my database so I can use that
stuff from HDFS later on). Basically, just logging a ton of information I
might want for analytics/model training later.
> 2. analytics processing.
> 3. model training (for machine learning). I'll primarily do this through
Mahout.
> 4. will probably want hbase on there as well for real time reading of
some data. I'm not married to this, and haven't played around much with
hbase yet, but wanted to leave the possibility open.
> The Plan:
> I'm thinking I'll set this up in Amazon. We have most of the rest of our
hardware there, and I really like the option to be able to spin up a bunch
of extra workers at will to have them train some ML model for me and then
kill them off. For now, just to get things off the ground, I'm going to
setup a small 4 machine cluster (1 NameNode, 1
SecondaryNameNode/JobTracker, 2 DataNode/TaskTrackers). I'll start playing
around with that setup, and will add more to it as needed. Since everything
will be puppetized, adding more machines shouldn't be too bad (I think).
I've been using Cloudera so far, and I haven't seen any good reason to
switch, so I'll use CDH4. For logging, I'll just use scribe and wind up
storing stuff as lzos (a good tutorial on the best way to do this would be
awesome).
> Thoughts?
> Eli

-- 
Regards,
    Mohammad Tariq