You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hadoop.apache.org by Deepak Goel <de...@gmail.com> on 2016/05/27 15:54:28 UTC

Reliability of Hadoop

Hey

Namaskara~Nalama~Guten Tag~Bonjour

We are yet to see any server go down in our cluster nodes in the production
environment? Has anyone seen reliability problems in their production
environment? How many times?

Thanks
Deepak
   --
Keigu

Deepak
73500 12833
www.simtree.net, deepak@simtree.net
deicool@gmail.com

LinkedIn: www.linkedin.com/in/deicool
Skype: thumsupdeicool
Google talk: deicool
Blog: http://loveandfearless.wordpress.com
Facebook: http://www.facebook.com/deicool

"Contribute to the world, environment and more : http://www.gridrepublic.org
"

Re: Reliability of Hadoop

Posted by Deepak Goel <de...@gmail.com>.

My thoughts inline (Sorry for my poor english earlier, I will try to
explain what I mean again)


On Sat, May 28, 2016 at 1:17 AM, Dejan Menges <de...@gmail.com>
 wrote:

> Hi Deepak,
>
> Hadoop is just platform (Hadoop and all around it). Toolset to do what you
> want to do.
>
> If you are writing bad code you can't blame programming language. It's you
> not being able to write good code. There's also nothing bad in using
> commodity hardware (and not sure I understand whats' commodity software).
> In this very moment, while we are exchanging this - how much do we know or
> care on which hardware mail servers are running? We don't, neither we care.
>

**********************Deepak**********************
I am not saying Hadoop is bad at all. Infact I am saying it is wonderful.
However the algorithms written in the past decades in (OS, JVM, our
applications) are perhaps not the best in terms of performance. Infact they
are governed by "Inverse Moore's law" which is something like "The quality
of software in terms of performance reduces by half every year". Now with
coming of Hadoop, the algorithms are run in parallel on many small
computers, and they don't have to be EFFICIENT at all. So in all
likelihood, the quality of our algorithms in OS, JVM, application (not
Hadoop!) will decrease further. As we are all from the software industry,
we must guard ourselves against this pitfall.

As to your question, whether we care about the hardware mail servers? Well
it is subjective and person dependent. From my perspective I do care (and I
might be writing this to satisfy my ego!). For example, I do keep thinking
on what servers do our software run on. Which CPU? What is the technology
inside the CPU? How much heat is the CPU generating? How smaller and faster
the CPU can get? And so on...
********************** Deepak**********************


> For whitepapers and use cases internet is full of them.
>
**********************Deepak**********************
I tried googling (Assuming Google is good at its job of finding what I
need) and could not find any whitepapers on the "cost v/s performance
benefit between the two technology". Can you please provide a link if you
have?

**********************Deepak**********************

>
> My company is keeping majority of the really important data in Hadoop
> ecosystem. Some of the best software developers I met so far are writing
> different types of code from it, from analytics to development of in house
> software and plugins for different things.
>

**********************Deepak**********************
I will draw inspiration from your example above and improve my software
skills (Actually I am almost fresh at software development and am starting
from scratch). Thank You. Appreciate it. :)

Hey

Namaskara~Nalama~Guten Tag~Bonjour


   --
Keigu

Deepak
73500 12833
www.simtree.net, deepak@simtree.net
deicool@gmail.com

LinkedIn: www.linkedin.com/in/deicool
Skype: thumsupdeicool
Google talk: deicool
Blog: http://loveandfearless.wordpress.com
Facebook: http://www.facebook.com/deicool

"Contribute to the world, environment and more : http://www.gridrepublic.org
"

On Sun, May 29, 2016 at 12:17 AM, Deepak Goel <de...@gmail.com> wrote:

> My thoughts inline (Sorry for my poor english earlier, I will try to
> explain what I mean again)
>
>
> On Sat, May 28, 2016 at 1:17 AM, Dejan Menges <de...@gmail.com>
> wrote:
>
>> Hi Deepak,
>>
>> Hadoop is just platform (Hadoop and all around it). Toolset to do what
>> you want to do.
>>
>> If you are writing bad code you can't blame programming language. It's
>> you not being able to write good code. There's also nothing bad in using
>> commodity hardware (and not sure I understand whats' commodity software).
>> In this very moment, while we are exchanging this - how much do we know or
>> care on which hardware mail servers are running? We don't, neither we care.
>>
>
> **********************Deepak**********************
> I am not saying Hadoop is bad at all. Infact I am saying it is wonderful.
> However the algorithms written in the past decades in (OS, JVM, our
> applications) are perhaps not the best in terms of performance. Infact they
> are governed by "Inverse Moore's law" which is something like "The quality
> of software in terms of performance reduces by half every year". Now with
> coming of Hadoop, the algorithms are run in parallel on many small
> computers, and they don't have to be inefficient at all. So in all
> likelihood, the quality of our algorithms in OS, JVM, application (not
> Hadoop!) will decrease further. As we are all from the software industry,
> we must guard ourselves against this pitfall.
>
> As to your question, whether we care about the hardware mail servers? Well
> it is subjective and person dependent. From my perspective I do care (and I
> might be writing this to satisfy my ego!). For example, I do keep thinking
> on what servers do our software run on. Which CPU? What is the technology
> inside the CPU? How much heat is the CPU generating? How smaller and faster
> the CPU can get? And so on...
> ********************** Deepak**********************
>
>
>> For whitepapers and use cases internet is full of them.
>>
> **********************Deepak**********************
> I tried googling (Assuming Google is good at its job of finding what I
> need) and could not find any whitepapers on the "cost v/s performance
> benefit between the two technology". Can you please provide a link if you
> have?
>
> **********************Deepak**********************
>
>>
>> My company is keeping majority of the really important data in Hadoop
>> ecosystem. Some of the best software developers I met so far are writing
>> different types of code from it, from analytics to development of in house
>> software and plugins for different things.
>>
>
> **********************Deepak**********************
> I will draw inspiration from your example above and improve my software
> skills (Actually I am almost fresh at software development and am starting
> from scratch). Thank You. Appreciate it. :)
>
> **********************Deepak**********************
>
>>
>> However, I'm not sure that anyone on any mailing list can give you
>> answers than you need. I would start with official documentation and
>> understanding how specific component works in depth and why it works the
>> way it works.
>>
>> My 2c
>>
>> Cheers,
>> Dejan
>>
>> On Fri, May 27, 2016 at 9:41 PM Deepak Goel <de...@gmail.com> wrote:
>>
>>> Sorry once again if I am wrong, or my comments are without significance
>>>
>>> I am not saying Hadoop is bad or good...It is just that Hadoop might be
>>> indirectly encouraging commodity hardware and software to be developed
>>> which is convenient but might not be very good (also the cost factor is
>>> unproven with no proper case studies or whitepaper)
>>>
>>> It is like the fast food industry, which is very convenient (a
>>> commodity) but causing obesity all over the world (And hence also causing
>>> many illness, poor health, social trauma therefore the cost of a burger to
>>> anyone is actually far more than what a company charges when you eat it)
>>>
>>> In effect what Hadoop (and all the other commercial software around it)
>>> is saying that its ok if you have bad software (Application, JVM, OS), I
>>> will provide another software which will hide all the problems of yours...
>>> We might all just go the obesity way in the software industry too
>>>
>>>
>>>
>>> Hey
>>>
>>> Namaskara~Nalama~Guten Tag~Bonjour
>>>
>>>
>>>
>>>    --
>>> Keigu
>>>
>>> Deepak
>>> 73500 12833
>>> www.simtree.net, deepak@simtree.net
>>> deicool@gmail.com
>>>
>>> LinkedIn: www.linkedin.com/in/deicool
>>> Skype: thumsupdeicool
>>> Google talk: deicool
>>> Blog: http://loveandfearless.wordpress.com
>>> Facebook: http://www.facebook.com/deicool
>>>
>>> "Contribute to the world, environment and more :
>>> http://www.gridrepublic.org
>>> "
>>>
>>> On Sat, May 28, 2016 at 12:51 AM, J. Rottinghuis <jrottinghuis@gmail.com
>>> > wrote:
>>>
>>>> We run several clusters of thousands of nodes (as do many companies),
>>>> our largest one has over 10K nodes. Disks, machines, memory, and network
>>>> fail all the time. The larger the scale, the higher the odds that some
>>>> machine is bad in a given day. On the other hand, scale helps. If a single
>>>> node our of 10K fails, 9,999 others participate in re-distributing state.
>>>> Even a rack failure isn't a big deal most of the time (plus typically a
>>>> rack fails due to a TOR issue, so the data is offline, but typically not
>>>> lost permanently).
>>>>
>>>> Hadoop is designed to deal with this, and by-and-large it does.
>>>> Critical components (such as Namenodes) can be configured to run in an HA
>>>> pair with automatic failover. There is quite a bit of work going on by many
>>>> in the Hadoop community to keep pushing the boundaries of scale.
>>>>
>>>> A node or a rack failing in a large cluster actually has less impact
>>>> than at smaller scale. With a 5-node cluster, if 1 machine crashes you've
>>>> taken 20% capacity (disk and compute) offline. 1 out of 1K barely
>>>> registers. Ditto with a 3 rack cluster. Loose a rack and 1/3rd of your
>>>> capacity is offline.
>>>>
>>>> It is large-scale coordinated failure you should worry about. Think
>>>> several rows of racks coming offline due to power failure, a DC going
>>>> offline due to fire in the building etc. Those are hard to deal with in
>>>> software within a single DC. They should also be more rare, but as many
>>>> companies have experienced, large scale coordinated failures do
>>>> occasionally happen.
>>>>
>>>> As to your question in the other email thread, it is a well-established
>>>> pattern that scaling horizontally with commodity hardware (and letting
>>>> software such as Hadoop deal with failures) help with both scale and
>>>> reducing cost.
>>>>
>>>> Cheers,
>>>>
>>>> Joep
>>>>
>>>>
>>>> On Fri, May 27, 2016 at 11:02 AM, Arun Natva <ar...@gmail.com>
>>>> wrote:
>>>>
>>>>> Deepak,
>>>>> I have managed clusters where worker nodes crashed, disks failed..
>>>>> HDFS takes care of the data replication unless you loose too many of
>>>>> the nodes where there is not enough space to fit the replicas.
>>>>>
>>>>>
>>>>>
>>>>> Sent from my iPhone
>>>>>
>>>>> On May 27, 2016, at 11:54 AM, Deepak Goel <de...@gmail.com> wrote:
>>>>>
>>>>>
>>>>> Hey
>>>>>
>>>>> Namaskara~Nalama~Guten Tag~Bonjour
>>>>>
>>>>> We are yet to see any server go down in our cluster nodes in the
>>>>> production environment? Has anyone seen reliability problems in their
>>>>> production environment? How many times?
>>>>>
>>>>> Thanks
>>>>> Deepak
>>>>>    --
>>>>> Keigu
>>>>>
>>>>> Deepak
>>>>> 73500 12833
>>>>> www.simtree.net, deepak@simtree.net
>>>>> deicool@gmail.com
>>>>>
>>>>> LinkedIn: www.linkedin.com/in/deicool
>>>>> Skype: thumsupdeicool
>>>>> Google talk: deicool
>>>>> Blog: http://loveandfearless.wordpress.com
>>>>> Facebook: http://www.facebook.com/deicool
>>>>>
>>>>> "Contribute to the world, environment and more :
>>>>> http://www.gridrepublic.org
>>>>> "
>>>>>
>>>>>
>>>>
>>>
>

Re: Reliability of Hadoop

Posted by "Sudhir.Kumar" <Su...@target.com>.

Deepak,
lets stop this discussion here.
I can only say your analogy is out of place. Civilizations did not fall because people in them became lazy or incompetent. You only know of the history of what is told to you. You can be never be sure what is told to you is factually correct.

From: Deepak Goel <de...@gmail.com>>
Date: Saturday, May 28, 2016 at 2:35 PM
To: "Sudhir.Kumar" <Su...@target.com>>
Cc: Dejan Menges <de...@gmail.com>>, "J. Rottinghuis" <jr...@gmail.com>>, Arun Natva <ar...@gmail.com>>, user <us...@hadoop.apache.org>>
Subject: Re: Reliability of Hadoop

I hope so, you are right & I am wrong. However history has shown it otherwise (not particularly software industry, it is still young). And perhaps we should reconsider this conversation after 10 years from now and see what is the truth...

For example, "When a country, community becomes strong or efficient, its people tend to become inefficient, lazy". Civilizations rise (the platform), then become complacent (the folks writing algorithm), and then fall...

Sorry for using such obtruse examples! I think, i should be evicted from this group (As most of my questions are not technical questions of Hadoop)...

Hey

Namaskara~Nalama~Guten Tag~Bonjour

   --
Keigu

Deepak
73500 12833
www.simtree.net<http://www.simtree.net>, deepak@simtree.net<ma...@simtree.net>
deicool@gmail.com<ma...@gmail.com>

LinkedIn: www.linkedin.com/in/deicool<http://www.linkedin.com/in/deicool>
Skype: thumsupdeicool
Google talk: deicool
Blog: http://loveandfearless.wordpress.com
Facebook: http://www.facebook.com/deicool

"Contribute to the world, environment and more : http://www.gridrepublic.org
"

On Sun, May 29, 2016 at 12:27 AM, Sudhir.Kumar <Su...@target.com>> wrote:
Hi Deepak,

Your assumption that folks would write bad algorithms because a platform is efficient is wrong and misguided. Wrong algorithm on efficient platform would bite you.

From: Deepak Goel <de...@gmail.com>>
Date: Saturday, May 28, 2016 at 1:47 PM
To: Dejan Menges <de...@gmail.com>>
Cc: "J. Rottinghuis" <jr...@gmail.com>>, Arun Natva <ar...@gmail.com>>, user <us...@hadoop.apache.org>>
Subject: Re: Reliability of Hadoop

My thoughts inline (Sorry for my poor english earlier, I will try to explain what I mean again)

On Sat, May 28, 2016 at 1:17 AM, Dejan Menges <de...@gmail.com>> wrote:
Hi Deepak,

Hadoop is just platform (Hadoop and all around it). Toolset to do what you want to do.

If you are writing bad code you can't blame programming language. It's you not being able to write good code. There's also nothing bad in using commodity hardware (and not sure I understand whats' commodity software). In this very moment, while we are exchanging this - how much do we know or care on which hardware mail servers are running? We don't, neither we care.

**********************Deepak**********************
I am not saying Hadoop is bad at all. Infact I am saying it is wonderful. However the algorithms written in the past decades in (OS, JVM, our applications) are perhaps not the best in terms of performance. Infact they are governed by "Inverse Moore's law" which is something like "The quality of software in terms of performance reduces by half every year". Now with coming of Hadoop, the algorithms are run in parallel on many small computers, and they don't have to be inefficient at all. So in all likelihood, the quality of our algorithms in OS, JVM, application (not Hadoop!) will decrease further. As we are all from the software industry, we must guard ourselves against this pitfall.

As to your question, whether we care about the hardware mail servers? Well it is subjective and person dependent. From my perspective I do care (and I might be writing this to satisfy my ego!). For example, I do keep thinking on what servers do our software run on. Which CPU? What is the technology inside the CPU? How much heat is the CPU generating? How smaller and faster the CPU can get? And so on...
********************** Deepak**********************

For whitepapers and use cases internet is full of them.
**********************Deepak**********************
I tried googling (Assuming Google is good at its job of finding what I need) and could not find any whitepapers on the "cost v/s performance benefit between the two technology". Can you please provide a link if you have?

**********************Deepak**********************

My company is keeping majority of the really important data in Hadoop ecosystem. Some of the best software developers I met so far are writing different types of code from it, from analytics to development of in house software and plugins for different things.

**********************Deepak**********************
I will draw inspiration from your example above and improve my software skills (Actually I am almost fresh at software development and am starting from scratch). Thank You. Appreciate it. :)

**********************Deepak**********************

However, I'm not sure that anyone on any mailing list can give you answers than you need. I would start with official documentation and understanding how specific component works in depth and why it works the way it works.

My 2c

Cheers,
Dejan

On Fri, May 27, 2016 at 9:41 PM Deepak Goel <de...@gmail.com>> wrote:
Sorry once again if I am wrong, or my comments are without significance

I am not saying Hadoop is bad or good...It is just that Hadoop might be indirectly encouraging commodity hardware and software to be developed which is convenient but might not be very good (also the cost factor is unproven with no proper case studies or whitepaper)

It is like the fast food industry, which is very convenient (a commodity) but causing obesity all over the world (And hence also causing many illness, poor health, social trauma therefore the cost of a burger to anyone is actually far more than what a company charges when you eat it)

In effect what Hadoop (and all the other commercial software around it) is saying that its ok if you have bad software (Application, JVM, OS), I will provide another software which will hide all the problems of yours... We might all just go the obesity way in the software industry too

Hey

Namaskara~Nalama~Guten Tag~Bonjour

   --
Keigu

Deepak
73500 12833
www.simtree.net<http://www.simtree.net>, deepak@simtree.net<ma...@simtree.net>
deicool@gmail.com<ma...@gmail.com>

LinkedIn: www.linkedin.com/in/deicool<http://www.linkedin.com/in/deicool>
Skype: thumsupdeicool
Google talk: deicool
Blog: http://loveandfearless.wordpress.com
Facebook: http://www.facebook.com/deicool

"Contribute to the world, environment and more : http://www.gridrepublic.org
"

On Sat, May 28, 2016 at 12:51 AM, J. Rottinghuis <jr...@gmail.com>> wrote:
We run several clusters of thousands of nodes (as do many companies), our largest one has over 10K nodes. Disks, machines, memory, and network fail all the time. The larger the scale, the higher the odds that some machine is bad in a given day. On the other hand, scale helps. If a single node our of 10K fails, 9,999 others participate in re-distributing state. Even a rack failure isn't a big deal most of the time (plus typically a rack fails due to a TOR issue, so the data is offline, but typically not lost permanently).

Hadoop is designed to deal with this, and by-and-large it does. Critical components (such as Namenodes) can be configured to run in an HA pair with automatic failover. There is quite a bit of work going on by many in the Hadoop community to keep pushing the boundaries of scale.

A node or a rack failing in a large cluster actually has less impact than at smaller scale. With a 5-node cluster, if 1 machine crashes you've taken 20% capacity (disk and compute) offline. 1 out of 1K barely registers. Ditto with a 3 rack cluster. Loose a rack and 1/3rd of your capacity is offline.

It is large-scale coordinated failure you should worry about. Think several rows of racks coming offline due to power failure, a DC going offline due to fire in the building etc. Those are hard to deal with in software within a single DC. They should also be more rare, but as many companies have experienced, large scale coordinated failures do occasionally happen.

As to your question in the other email thread, it is a well-established pattern that scaling horizontally with commodity hardware (and letting software such as Hadoop deal with failures) help with both scale and reducing cost.

Cheers,

Joep

On Fri, May 27, 2016 at 11:02 AM, Arun Natva <ar...@gmail.com>> wrote:
Deepak,
I have managed clusters where worker nodes crashed, disks failed..
HDFS takes care of the data replication unless you loose too many of the nodes where there is not enough space to fit the replicas.

Sent from my iPhone

On May 27, 2016, at 11:54 AM, Deepak Goel <de...@gmail.com>> wrote:

Hey

Namaskara~Nalama~Guten Tag~Bonjour

We are yet to see any server go down in our cluster nodes in the production environment? Has anyone seen reliability problems in their production environment? How many times?

Thanks
Deepak
   --
Keigu

Deepak
73500 12833
www.simtree.net<http://www.simtree.net>, deepak@simtree.net<ma...@simtree.net>
deicool@gmail.com<ma...@gmail.com>

LinkedIn: www.linkedin.com/in/deicool<http://www.linkedin.com/in/deicool>
Skype: thumsupdeicool
Google talk: deicool
Blog: http://loveandfearless.wordpress.com
Facebook: http://www.facebook.com/deicool

"Contribute to the world, environment and more : http://www.gridrepublic.org
"

Re: Reliability of Hadoop

Posted by Deepak Goel <de...@gmail.com>.

I hope so, you are right & I am wrong. However history has shown it
otherwise (not particularly software industry, it is still young). And
perhaps we should reconsider this conversation after 10 years from now and
see what is the truth...

For example, "When a country, community becomes strong or efficient, its
people tend to become inefficient, lazy". Civilizations rise (the
platform), then become complacent (the folks writing algorithm), and then
fall...

Sorry for using such obtruse examples! I think, i should be evicted from
this group (As most of my questions are not technical questions of
Hadoop)...




Hey

Namaskara~Nalama~Guten Tag~Bonjour


   --
Keigu

Deepak
73500 12833
www.simtree.net, deepak@simtree.net
deicool@gmail.com

LinkedIn: www.linkedin.com/in/deicool
Skype: thumsupdeicool
Google talk: deicool
Blog: http://loveandfearless.wordpress.com
Facebook: http://www.facebook.com/deicool

"Contribute to the world, environment and more : http://www.gridrepublic.org
"

On Sun, May 29, 2016 at 12:27 AM, Sudhir.Kumar <Su...@target.com>
wrote:

> Hi Deepak,
>
> Your assumption that folks would write bad algorithms because a platform
> is efficient is wrong and misguided. Wrong algorithm on efficient platform
> would bite you.
>
> From: Deepak Goel <de...@gmail.com>
> Date: Saturday, May 28, 2016 at 1:47 PM
> To: Dejan Menges <de...@gmail.com>
> Cc: "J. Rottinghuis" <jr...@gmail.com>, Arun Natva <
> arun.natva@gmail.com>, user <us...@hadoop.apache.org>
> Subject: Re: Reliability of Hadoop
>
> My thoughts inline (Sorry for my poor english earlier, I will try to
> explain what I mean again)
>
>
> On Sat, May 28, 2016 at 1:17 AM, Dejan Menges <de...@gmail.com>
> wrote:
>
>> Hi Deepak,
>>
>> Hadoop is just platform (Hadoop and all around it). Toolset to do what
>> you want to do.
>>
>> If you are writing bad code you can't blame programming language. It's
>> you not being able to write good code. There's also nothing bad in using
>> commodity hardware (and not sure I understand whats' commodity software).
>> In this very moment, while we are exchanging this - how much do we know or
>> care on which hardware mail servers are running? We don't, neither we care.
>>
>
> **********************Deepak**********************
> I am not saying Hadoop is bad at all. Infact I am saying it is wonderful.
> However the algorithms written in the past decades in (OS, JVM, our
> applications) are perhaps not the best in terms of performance. Infact they
> are governed by "Inverse Moore's law" which is something like "The quality
> of software in terms of performance reduces by half every year". Now with
> coming of Hadoop, the algorithms are run in parallel on many small
> computers, and they don't have to be inefficient at all. So in all
> likelihood, the quality of our algorithms in OS, JVM, application (not
> Hadoop!) will decrease further. As we are all from the software industry,
> we must guard ourselves against this pitfall.
>
> As to your question, whether we care about the hardware mail servers? Well
> it is subjective and person dependent. From my perspective I do care (and I
> might be writing this to satisfy my ego!). For example, I do keep thinking
> on what servers do our software run on. Which CPU? What is the technology
> inside the CPU? How much heat is the CPU generating? How smaller and faster
> the CPU can get? And so on...
> ********************** Deepak**********************
>
>
>> For whitepapers and use cases internet is full of them.
>>
> **********************Deepak**********************
> I tried googling (Assuming Google is good at its job of finding what I
> need) and could not find any whitepapers on the "cost v/s performance
> benefit between the two technology". Can you please provide a link if you
> have?
>
> **********************Deepak**********************
>
>>
>> My company is keeping majority of the really important data in Hadoop
>> ecosystem. Some of the best software developers I met so far are writing
>> different types of code from it, from analytics to development of in house
>> software and plugins for different things.
>>
>
> **********************Deepak**********************
> I will draw inspiration from your example above and improve my software
> skills (Actually I am almost fresh at software development and am starting
> from scratch). Thank You. Appreciate it. :)
>
> **********************Deepak**********************
>
>>
>> However, I'm not sure that anyone on any mailing list can give you
>> answers than you need. I would start with official documentation and
>> understanding how specific component works in depth and why it works the
>> way it works.
>>
>> My 2c
>>
>> Cheers,
>> Dejan
>>
>> On Fri, May 27, 2016 at 9:41 PM Deepak Goel <de...@gmail.com> wrote:
>>
>>> Sorry once again if I am wrong, or my comments are without significance
>>>
>>> I am not saying Hadoop is bad or good...It is just that Hadoop might be
>>> indirectly encouraging commodity hardware and software to be developed
>>> which is convenient but might not be very good (also the cost factor is
>>> unproven with no proper case studies or whitepaper)
>>>
>>> It is like the fast food industry, which is very convenient (a
>>> commodity) but causing obesity all over the world (And hence also causing
>>> many illness, poor health, social trauma therefore the cost of a burger to
>>> anyone is actually far more than what a company charges when you eat it)
>>>
>>> In effect what Hadoop (and all the other commercial software around it)
>>> is saying that its ok if you have bad software (Application, JVM, OS), I
>>> will provide another software which will hide all the problems of yours...
>>> We might all just go the obesity way in the software industry too
>>>
>>>
>>>
>>> Hey
>>>
>>> Namaskara~Nalama~Guten Tag~Bonjour
>>>
>>>
>>>
>>>    --
>>> Keigu
>>>
>>> Deepak
>>> 73500 12833
>>> www.simtree.net, deepak@simtree.net
>>> deicool@gmail.com
>>>
>>> LinkedIn: www.linkedin.com/in/deicool
>>> Skype: thumsupdeicool
>>> Google talk: deicool
>>> Blog: http://loveandfearless.wordpress.com
>>> Facebook: http://www.facebook.com/deicool
>>>
>>> "Contribute to the world, environment and more :
>>> http://www.gridrepublic.org
>>> "
>>>
>>> On Sat, May 28, 2016 at 12:51 AM, J. Rottinghuis <jrottinghuis@gmail.com
>>> > wrote:
>>>
>>>> We run several clusters of thousands of nodes (as do many companies),
>>>> our largest one has over 10K nodes. Disks, machines, memory, and network
>>>> fail all the time. The larger the scale, the higher the odds that some
>>>> machine is bad in a given day. On the other hand, scale helps. If a single
>>>> node our of 10K fails, 9,999 others participate in re-distributing state.
>>>> Even a rack failure isn't a big deal most of the time (plus typically a
>>>> rack fails due to a TOR issue, so the data is offline, but typically not
>>>> lost permanently).
>>>>
>>>> Hadoop is designed to deal with this, and by-and-large it does.
>>>> Critical components (such as Namenodes) can be configured to run in an HA
>>>> pair with automatic failover. There is quite a bit of work going on by many
>>>> in the Hadoop community to keep pushing the boundaries of scale.
>>>>
>>>> A node or a rack failing in a large cluster actually has less impact
>>>> than at smaller scale. With a 5-node cluster, if 1 machine crashes you've
>>>> taken 20% capacity (disk and compute) offline. 1 out of 1K barely
>>>> registers. Ditto with a 3 rack cluster. Loose a rack and 1/3rd of your
>>>> capacity is offline.
>>>>
>>>> It is large-scale coordinated failure you should worry about. Think
>>>> several rows of racks coming offline due to power failure, a DC going
>>>> offline due to fire in the building etc. Those are hard to deal with in
>>>> software within a single DC. They should also be more rare, but as many
>>>> companies have experienced, large scale coordinated failures do
>>>> occasionally happen.
>>>>
>>>> As to your question in the other email thread, it is a well-established
>>>> pattern that scaling horizontally with commodity hardware (and letting
>>>> software such as Hadoop deal with failures) help with both scale and
>>>> reducing cost.
>>>>
>>>> Cheers,
>>>>
>>>> Joep
>>>>
>>>>
>>>> On Fri, May 27, 2016 at 11:02 AM, Arun Natva <ar...@gmail.com>
>>>> wrote:
>>>>
>>>>> Deepak,
>>>>> I have managed clusters where worker nodes crashed, disks failed..
>>>>> HDFS takes care of the data replication unless you loose too many of
>>>>> the nodes where there is not enough space to fit the replicas.
>>>>>
>>>>>
>>>>>
>>>>> Sent from my iPhone
>>>>>
>>>>> On May 27, 2016, at 11:54 AM, Deepak Goel <de...@gmail.com> wrote:
>>>>>
>>>>>
>>>>> Hey
>>>>>
>>>>> Namaskara~Nalama~Guten Tag~Bonjour
>>>>>
>>>>> We are yet to see any server go down in our cluster nodes in the
>>>>> production environment? Has anyone seen reliability problems in their
>>>>> production environment? How many times?
>>>>>
>>>>> Thanks
>>>>> Deepak
>>>>>    --
>>>>> Keigu
>>>>>
>>>>> Deepak
>>>>> 73500 12833
>>>>> www.simtree.net, deepak@simtree.net
>>>>> deicool@gmail.com
>>>>>
>>>>> LinkedIn: www.linkedin.com/in/deicool
>>>>> Skype: thumsupdeicool
>>>>> Google talk: deicool
>>>>> Blog: http://loveandfearless.wordpress.com
>>>>> Facebook: http://www.facebook.com/deicool
>>>>>
>>>>> "Contribute to the world, environment and more :
>>>>> http://www.gridrepublic.org
>>>>> "
>>>>>
>>>>>
>>>>
>>>
>

Re: Reliability of Hadoop

Posted by "Sudhir.Kumar" <Su...@target.com>.

Hi Deepak,

Your assumption that folks would write bad algorithms because a platform is efficient is wrong and misguided. Wrong algorithm on efficient platform would bite you.

From: Deepak Goel <de...@gmail.com>>
Date: Saturday, May 28, 2016 at 1:47 PM
To: Dejan Menges <de...@gmail.com>>
Cc: "J. Rottinghuis" <jr...@gmail.com>>, Arun Natva <ar...@gmail.com>>, user <us...@hadoop.apache.org>>
Subject: Re: Reliability of Hadoop

My thoughts inline (Sorry for my poor english earlier, I will try to explain what I mean again)

On Sat, May 28, 2016 at 1:17 AM, Dejan Menges <de...@gmail.com>> wrote:
Hi Deepak,

Hadoop is just platform (Hadoop and all around it). Toolset to do what you want to do.

If you are writing bad code you can't blame programming language. It's you not being able to write good code. There's also nothing bad in using commodity hardware (and not sure I understand whats' commodity software). In this very moment, while we are exchanging this - how much do we know or care on which hardware mail servers are running? We don't, neither we care.

**********************Deepak**********************
I am not saying Hadoop is bad at all. Infact I am saying it is wonderful. However the algorithms written in the past decades in (OS, JVM, our applications) are perhaps not the best in terms of performance. Infact they are governed by "Inverse Moore's law" which is something like "The quality of software in terms of performance reduces by half every year". Now with coming of Hadoop, the algorithms are run in parallel on many small computers, and they don't have to be inefficient at all. So in all likelihood, the quality of our algorithms in OS, JVM, application (not Hadoop!) will decrease further. As we are all from the software industry, we must guard ourselves against this pitfall.

As to your question, whether we care about the hardware mail servers? Well it is subjective and person dependent. From my perspective I do care (and I might be writing this to satisfy my ego!). For example, I do keep thinking on what servers do our software run on. Which CPU? What is the technology inside the CPU? How much heat is the CPU generating? How smaller and faster the CPU can get? And so on...
********************** Deepak**********************

For whitepapers and use cases internet is full of them.
**********************Deepak**********************
I tried googling (Assuming Google is good at its job of finding what I need) and could not find any whitepapers on the "cost v/s performance benefit between the two technology". Can you please provide a link if you have?

**********************Deepak**********************

My company is keeping majority of the really important data in Hadoop ecosystem. Some of the best software developers I met so far are writing different types of code from it, from analytics to development of in house software and plugins for different things.

**********************Deepak**********************
I will draw inspiration from your example above and improve my software skills (Actually I am almost fresh at software development and am starting from scratch). Thank You. Appreciate it. :)

**********************Deepak**********************

However, I'm not sure that anyone on any mailing list can give you answers than you need. I would start with official documentation and understanding how specific component works in depth and why it works the way it works.

My 2c

Cheers,
Dejan

On Fri, May 27, 2016 at 9:41 PM Deepak Goel <de...@gmail.com>> wrote:
Sorry once again if I am wrong, or my comments are without significance

I am not saying Hadoop is bad or good...It is just that Hadoop might be indirectly encouraging commodity hardware and software to be developed which is convenient but might not be very good (also the cost factor is unproven with no proper case studies or whitepaper)

It is like the fast food industry, which is very convenient (a commodity) but causing obesity all over the world (And hence also causing many illness, poor health, social trauma therefore the cost of a burger to anyone is actually far more than what a company charges when you eat it)

In effect what Hadoop (and all the other commercial software around it) is saying that its ok if you have bad software (Application, JVM, OS), I will provide another software which will hide all the problems of yours... We might all just go the obesity way in the software industry too

Hey

Namaskara~Nalama~Guten Tag~Bonjour

   --
Keigu

Deepak
73500 12833
www.simtree.net<http://www.simtree.net>, deepak@simtree.net<ma...@simtree.net>
deicool@gmail.com<ma...@gmail.com>

LinkedIn: www.linkedin.com/in/deicool<http://www.linkedin.com/in/deicool>
Skype: thumsupdeicool
Google talk: deicool
Blog: http://loveandfearless.wordpress.com
Facebook: http://www.facebook.com/deicool

"Contribute to the world, environment and more : http://www.gridrepublic.org
"

On Sat, May 28, 2016 at 12:51 AM, J. Rottinghuis <jr...@gmail.com>> wrote:
We run several clusters of thousands of nodes (as do many companies), our largest one has over 10K nodes. Disks, machines, memory, and network fail all the time. The larger the scale, the higher the odds that some machine is bad in a given day. On the other hand, scale helps. If a single node our of 10K fails, 9,999 others participate in re-distributing state. Even a rack failure isn't a big deal most of the time (plus typically a rack fails due to a TOR issue, so the data is offline, but typically not lost permanently).

Hadoop is designed to deal with this, and by-and-large it does. Critical components (such as Namenodes) can be configured to run in an HA pair with automatic failover. There is quite a bit of work going on by many in the Hadoop community to keep pushing the boundaries of scale.

A node or a rack failing in a large cluster actually has less impact than at smaller scale. With a 5-node cluster, if 1 machine crashes you've taken 20% capacity (disk and compute) offline. 1 out of 1K barely registers. Ditto with a 3 rack cluster. Loose a rack and 1/3rd of your capacity is offline.

It is large-scale coordinated failure you should worry about. Think several rows of racks coming offline due to power failure, a DC going offline due to fire in the building etc. Those are hard to deal with in software within a single DC. They should also be more rare, but as many companies have experienced, large scale coordinated failures do occasionally happen.

As to your question in the other email thread, it is a well-established pattern that scaling horizontally with commodity hardware (and letting software such as Hadoop deal with failures) help with both scale and reducing cost.

Cheers,

Joep

On Fri, May 27, 2016 at 11:02 AM, Arun Natva <ar...@gmail.com>> wrote:
Deepak,
I have managed clusters where worker nodes crashed, disks failed..
HDFS takes care of the data replication unless you loose too many of the nodes where there is not enough space to fit the replicas.

Sent from my iPhone

On May 27, 2016, at 11:54 AM, Deepak Goel <de...@gmail.com>> wrote:

Hey

Namaskara~Nalama~Guten Tag~Bonjour

We are yet to see any server go down in our cluster nodes in the production environment? Has anyone seen reliability problems in their production environment? How many times?

Thanks
Deepak
   --
Keigu

Deepak
73500 12833
www.simtree.net<http://www.simtree.net>, deepak@simtree.net<ma...@simtree.net>
deicool@gmail.com<ma...@gmail.com>

LinkedIn: www.linkedin.com/in/deicool<http://www.linkedin.com/in/deicool>
Skype: thumsupdeicool
Google talk: deicool
Blog: http://loveandfearless.wordpress.com
Facebook: http://www.facebook.com/deicool

"Contribute to the world, environment and more : http://www.gridrepublic.org
"

Re: Reliability of Hadoop

Posted by Deepak Goel <de...@gmail.com>.

My thoughts inline (Sorry for my poor english earlier, I will try to
explain what I mean again)


On Sat, May 28, 2016 at 1:17 AM, Dejan Menges <de...@gmail.com>
wrote:

> Hi Deepak,
>
> Hadoop is just platform (Hadoop and all around it). Toolset to do what you
> want to do.
>
> If you are writing bad code you can't blame programming language. It's you
> not being able to write good code. There's also nothing bad in using
> commodity hardware (and not sure I understand whats' commodity software).
> In this very moment, while we are exchanging this - how much do we know or
> care on which hardware mail servers are running? We don't, neither we care.
>

**********************Deepak**********************
I am not saying Hadoop is bad at all. Infact I am saying it is wonderful.
However the algorithms written in the past decades in (OS, JVM, our
applications) are perhaps not the best in terms of performance. Infact they
are governed by "Inverse Moore's law" which is something like "The quality
of software in terms of performance reduces by half every year". Now with
coming of Hadoop, the algorithms are run in parallel on many small
computers, and they don't have to be inefficient at all. So in all
likelihood, the quality of our algorithms in OS, JVM, application (not
Hadoop!) will decrease further. As we are all from the software industry,
we must guard ourselves against this pitfall.

As to your question, whether we care about the hardware mail servers? Well
it is subjective and person dependent. From my perspective I do care (and I
might be writing this to satisfy my ego!). For example, I do keep thinking
on what servers do our software run on. Which CPU? What is the technology
inside the CPU? How much heat is the CPU generating? How smaller and faster
the CPU can get? And so on...
********************** Deepak**********************


> For whitepapers and use cases internet is full of them.
>
**********************Deepak**********************
I tried googling (Assuming Google is good at its job of finding what I
need) and could not find any whitepapers on the "cost v/s performance
benefit between the two technology". Can you please provide a link if you
have?

**********************Deepak**********************

>
> My company is keeping majority of the really important data in Hadoop
> ecosystem. Some of the best software developers I met so far are writing
> different types of code from it, from analytics to development of in house
> software and plugins for different things.
>

**********************Deepak**********************
I will draw inspiration from your example above and improve my software
skills (Actually I am almost fresh at software development and am starting
from scratch). Thank You. Appreciate it. :)

**********************Deepak**********************

>
> However, I'm not sure that anyone on any mailing list can give you answers
> than you need. I would start with official documentation and understanding
> how specific component works in depth and why it works the way it works.
>
> My 2c
>
> Cheers,
> Dejan
>
> On Fri, May 27, 2016 at 9:41 PM Deepak Goel <de...@gmail.com> wrote:
>
>> Sorry once again if I am wrong, or my comments are without significance
>>
>> I am not saying Hadoop is bad or good...It is just that Hadoop might be
>> indirectly encouraging commodity hardware and software to be developed
>> which is convenient but might not be very good (also the cost factor is
>> unproven with no proper case studies or whitepaper)
>>
>> It is like the fast food industry, which is very convenient (a commodity)
>> but causing obesity all over the world (And hence also causing many
>> illness, poor health, social trauma therefore the cost of a burger to
>> anyone is actually far more than what a company charges when you eat it)
>>
>> In effect what Hadoop (and all the other commercial software around it)
>> is saying that its ok if you have bad software (Application, JVM, OS), I
>> will provide another software which will hide all the problems of yours...
>> We might all just go the obesity way in the software industry too
>>
>>
>>
>> Hey
>>
>> Namaskara~Nalama~Guten Tag~Bonjour
>>
>>
>>
>>    --
>> Keigu
>>
>> Deepak
>> 73500 12833
>> www.simtree.net, deepak@simtree.net
>> deicool@gmail.com
>>
>> LinkedIn: www.linkedin.com/in/deicool
>> Skype: thumsupdeicool
>> Google talk: deicool
>> Blog: http://loveandfearless.wordpress.com
>> Facebook: http://www.facebook.com/deicool
>>
>> "Contribute to the world, environment and more :
>> http://www.gridrepublic.org
>> "
>>
>> On Sat, May 28, 2016 at 12:51 AM, J. Rottinghuis <jr...@gmail.com>
>> wrote:
>>
>>> We run several clusters of thousands of nodes (as do many companies),
>>> our largest one has over 10K nodes. Disks, machines, memory, and network
>>> fail all the time. The larger the scale, the higher the odds that some
>>> machine is bad in a given day. On the other hand, scale helps. If a single
>>> node our of 10K fails, 9,999 others participate in re-distributing state.
>>> Even a rack failure isn't a big deal most of the time (plus typically a
>>> rack fails due to a TOR issue, so the data is offline, but typically not
>>> lost permanently).
>>>
>>> Hadoop is designed to deal with this, and by-and-large it does. Critical
>>> components (such as Namenodes) can be configured to run in an HA pair with
>>> automatic failover. There is quite a bit of work going on by many in the
>>> Hadoop community to keep pushing the boundaries of scale.
>>>
>>> A node or a rack failing in a large cluster actually has less impact
>>> than at smaller scale. With a 5-node cluster, if 1 machine crashes you've
>>> taken 20% capacity (disk and compute) offline. 1 out of 1K barely
>>> registers. Ditto with a 3 rack cluster. Loose a rack and 1/3rd of your
>>> capacity is offline.
>>>
>>> It is large-scale coordinated failure you should worry about. Think
>>> several rows of racks coming offline due to power failure, a DC going
>>> offline due to fire in the building etc. Those are hard to deal with in
>>> software within a single DC. They should also be more rare, but as many
>>> companies have experienced, large scale coordinated failures do
>>> occasionally happen.
>>>
>>> As to your question in the other email thread, it is a well-established
>>> pattern that scaling horizontally with commodity hardware (and letting
>>> software such as Hadoop deal with failures) help with both scale and
>>> reducing cost.
>>>
>>> Cheers,
>>>
>>> Joep
>>>
>>>
>>> On Fri, May 27, 2016 at 11:02 AM, Arun Natva <ar...@gmail.com>
>>> wrote:
>>>
>>>> Deepak,
>>>> I have managed clusters where worker nodes crashed, disks failed..
>>>> HDFS takes care of the data replication unless you loose too many of
>>>> the nodes where there is not enough space to fit the replicas.
>>>>
>>>>
>>>>
>>>> Sent from my iPhone
>>>>
>>>> On May 27, 2016, at 11:54 AM, Deepak Goel <de...@gmail.com> wrote:
>>>>
>>>>
>>>> Hey
>>>>
>>>> Namaskara~Nalama~Guten Tag~Bonjour
>>>>
>>>> We are yet to see any server go down in our cluster nodes in the
>>>> production environment? Has anyone seen reliability problems in their
>>>> production environment? How many times?
>>>>
>>>> Thanks
>>>> Deepak
>>>>    --
>>>> Keigu
>>>>
>>>> Deepak
>>>> 73500 12833
>>>> www.simtree.net, deepak@simtree.net
>>>> deicool@gmail.com
>>>>
>>>> LinkedIn: www.linkedin.com/in/deicool
>>>> Skype: thumsupdeicool
>>>> Google talk: deicool
>>>> Blog: http://loveandfearless.wordpress.com
>>>> Facebook: http://www.facebook.com/deicool
>>>>
>>>> "Contribute to the world, environment and more :
>>>> http://www.gridrepublic.org
>>>> "
>>>>
>>>>
>>>
>>

Re: Reliability of Hadoop

Posted by Dejan Menges <de...@gmail.com>.

Hi Deepak,

Hadoop is just platform (Hadoop and all around it). Toolset to do what you
want to do.

If you are writing bad code you can't blame programming language. It's you
not being able to write good code. There's also nothing bad in using
commodity hardware (and not sure I understand whats' commodity software).
In this very moment, while we are exchanging this - how much do we know or
care on which hardware mail servers are running? We don't, neither we care.

For whitepapers and use cases internet is full of them.

My company is keeping majority of the really important data in Hadoop
ecosystem. Some of the best software developers I met so far are writing
different types of code from it, from analytics to development of in house
software and plugins for different things.

However, I'm not sure that anyone on any mailing list can give you answers
than you need. I would start with official documentation and understanding
how specific component works in depth and why it works the way it works.

My 2c

Cheers,
Dejan

On Fri, May 27, 2016 at 9:41 PM Deepak Goel <de...@gmail.com> wrote:

> Sorry once again if I am wrong, or my comments are without significance
>
> I am not saying Hadoop is bad or good...It is just that Hadoop might be
> indirectly encouraging commodity hardware and software to be developed
> which is convenient but might not be very good (also the cost factor is
> unproven with no proper case studies or whitepaper)
>
> It is like the fast food industry, which is very convenient (a commodity)
> but causing obesity all over the world (And hence also causing many
> illness, poor health, social trauma therefore the cost of a burger to
> anyone is actually far more than what a company charges when you eat it)
>
> In effect what Hadoop (and all the other commercial software around it) is
> saying that its ok if you have bad software (Application, JVM, OS), I will
> provide another software which will hide all the problems of yours... We
> might all just go the obesity way in the software industry too
>
>
>
> Hey
>
> Namaskara~Nalama~Guten Tag~Bonjour
>
>
>
>    --
> Keigu
>
> Deepak
> 73500 12833
> www.simtree.net, deepak@simtree.net
> deicool@gmail.com
>
> LinkedIn: www.linkedin.com/in/deicool
> Skype: thumsupdeicool
> Google talk: deicool
> Blog: http://loveandfearless.wordpress.com
> Facebook: http://www.facebook.com/deicool
>
> "Contribute to the world, environment and more :
> http://www.gridrepublic.org
> "
>
> On Sat, May 28, 2016 at 12:51 AM, J. Rottinghuis <jr...@gmail.com>
> wrote:
>
>> We run several clusters of thousands of nodes (as do many companies), our
>> largest one has over 10K nodes. Disks, machines, memory, and network fail
>> all the time. The larger the scale, the higher the odds that some machine
>> is bad in a given day. On the other hand, scale helps. If a single node our
>> of 10K fails, 9,999 others participate in re-distributing state. Even a
>> rack failure isn't a big deal most of the time (plus typically a rack fails
>> due to a TOR issue, so the data is offline, but typically not lost
>> permanently).
>>
>> Hadoop is designed to deal with this, and by-and-large it does. Critical
>> components (such as Namenodes) can be configured to run in an HA pair with
>> automatic failover. There is quite a bit of work going on by many in the
>> Hadoop community to keep pushing the boundaries of scale.
>>
>> A node or a rack failing in a large cluster actually has less impact than
>> at smaller scale. With a 5-node cluster, if 1 machine crashes you've taken
>> 20% capacity (disk and compute) offline. 1 out of 1K barely registers.
>> Ditto with a 3 rack cluster. Loose a rack and 1/3rd of your capacity is
>> offline.
>>
>> It is large-scale coordinated failure you should worry about. Think
>> several rows of racks coming offline due to power failure, a DC going
>> offline due to fire in the building etc. Those are hard to deal with in
>> software within a single DC. They should also be more rare, but as many
>> companies have experienced, large scale coordinated failures do
>> occasionally happen.
>>
>> As to your question in the other email thread, it is a well-established
>> pattern that scaling horizontally with commodity hardware (and letting
>> software such as Hadoop deal with failures) help with both scale and
>> reducing cost.
>>
>> Cheers,
>>
>> Joep
>>
>>
>> On Fri, May 27, 2016 at 11:02 AM, Arun Natva <ar...@gmail.com>
>> wrote:
>>
>>> Deepak,
>>> I have managed clusters where worker nodes crashed, disks failed..
>>> HDFS takes care of the data replication unless you loose too many of the
>>> nodes where there is not enough space to fit the replicas.
>>>
>>>
>>>
>>> Sent from my iPhone
>>>
>>> On May 27, 2016, at 11:54 AM, Deepak Goel <de...@gmail.com> wrote:
>>>
>>>
>>> Hey
>>>
>>> Namaskara~Nalama~Guten Tag~Bonjour
>>>
>>> We are yet to see any server go down in our cluster nodes in the
>>> production environment? Has anyone seen reliability problems in their
>>> production environment? How many times?
>>>
>>> Thanks
>>> Deepak
>>>    --
>>> Keigu
>>>
>>> Deepak
>>> 73500 12833
>>> www.simtree.net, deepak@simtree.net
>>> deicool@gmail.com
>>>
>>> LinkedIn: www.linkedin.com/in/deicool
>>> Skype: thumsupdeicool
>>> Google talk: deicool
>>> Blog: http://loveandfearless.wordpress.com
>>> Facebook: http://www.facebook.com/deicool
>>>
>>> "Contribute to the world, environment and more :
>>> http://www.gridrepublic.org
>>> "
>>>
>>>
>>
>

Re: Reliability of Hadoop

Posted by Deepak Goel <de...@gmail.com>.

Sorry once again if I am wrong, or my comments are without significance

I am not saying Hadoop is bad or good...It is just that Hadoop might be
indirectly encouraging commodity hardware and software to be developed
which is convenient but might not be very good (also the cost factor is
unproven with no proper case studies or whitepaper)

It is like the fast food industry, which is very convenient (a commodity)
but causing obesity all over the world (And hence also causing many
illness, poor health, social trauma therefore the cost of a burger to
anyone is actually far more than what a company charges when you eat it)

In effect what Hadoop (and all the other commercial software around it) is
saying that its ok if you have bad software (Application, JVM, OS), I will
provide another software which will hide all the problems of yours... We
might all just go the obesity way in the software industry too



Hey

Namaskara~Nalama~Guten Tag~Bonjour


   --
Keigu

Deepak
73500 12833
www.simtree.net, deepak@simtree.net
deicool@gmail.com

LinkedIn: www.linkedin.com/in/deicool
Skype: thumsupdeicool
Google talk: deicool
Blog: http://loveandfearless.wordpress.com
Facebook: http://www.facebook.com/deicool

"Contribute to the world, environment and more : http://www.gridrepublic.org
"

On Sat, May 28, 2016 at 12:51 AM, J. Rottinghuis <jr...@gmail.com>
wrote:

> We run several clusters of thousands of nodes (as do many companies), our
> largest one has over 10K nodes. Disks, machines, memory, and network fail
> all the time. The larger the scale, the higher the odds that some machine
> is bad in a given day. On the other hand, scale helps. If a single node our
> of 10K fails, 9,999 others participate in re-distributing state. Even a
> rack failure isn't a big deal most of the time (plus typically a rack fails
> due to a TOR issue, so the data is offline, but typically not lost
> permanently).
>
> Hadoop is designed to deal with this, and by-and-large it does. Critical
> components (such as Namenodes) can be configured to run in an HA pair with
> automatic failover. There is quite a bit of work going on by many in the
> Hadoop community to keep pushing the boundaries of scale.
>
> A node or a rack failing in a large cluster actually has less impact than
> at smaller scale. With a 5-node cluster, if 1 machine crashes you've taken
> 20% capacity (disk and compute) offline. 1 out of 1K barely registers.
> Ditto with a 3 rack cluster. Loose a rack and 1/3rd of your capacity is
> offline.
>
> It is large-scale coordinated failure you should worry about. Think
> several rows of racks coming offline due to power failure, a DC going
> offline due to fire in the building etc. Those are hard to deal with in
> software within a single DC. They should also be more rare, but as many
> companies have experienced, large scale coordinated failures do
> occasionally happen.
>
> As to your question in the other email thread, it is a well-established
> pattern that scaling horizontally with commodity hardware (and letting
> software such as Hadoop deal with failures) help with both scale and
> reducing cost.
>
> Cheers,
>
> Joep
>
>
> On Fri, May 27, 2016 at 11:02 AM, Arun Natva <ar...@gmail.com> wrote:
>
>> Deepak,
>> I have managed clusters where worker nodes crashed, disks failed..
>> HDFS takes care of the data replication unless you loose too many of the
>> nodes where there is not enough space to fit the replicas.
>>
>>
>>
>> Sent from my iPhone
>>
>> On May 27, 2016, at 11:54 AM, Deepak Goel <de...@gmail.com> wrote:
>>
>>
>> Hey
>>
>> Namaskara~Nalama~Guten Tag~Bonjour
>>
>> We are yet to see any server go down in our cluster nodes in the
>> production environment? Has anyone seen reliability problems in their
>> production environment? How many times?
>>
>> Thanks
>> Deepak
>>    --
>> Keigu
>>
>> Deepak
>> 73500 12833
>> www.simtree.net, deepak@simtree.net
>> deicool@gmail.com
>>
>> LinkedIn: www.linkedin.com/in/deicool
>> Skype: thumsupdeicool
>> Google talk: deicool
>> Blog: http://loveandfearless.wordpress.com
>> Facebook: http://www.facebook.com/deicool
>>
>> "Contribute to the world, environment and more :
>> http://www.gridrepublic.org
>> "
>>
>>
>

Re: Reliability of Hadoop

Posted by "J. Rottinghuis" <jr...@gmail.com>.

We run several clusters of thousands of nodes (as do many companies), our
largest one has over 10K nodes. Disks, machines, memory, and network fail
all the time. The larger the scale, the higher the odds that some machine
is bad in a given day. On the other hand, scale helps. If a single node our
of 10K fails, 9,999 others participate in re-distributing state. Even a
rack failure isn't a big deal most of the time (plus typically a rack fails
due to a TOR issue, so the data is offline, but typically not lost
permanently).

Hadoop is designed to deal with this, and by-and-large it does. Critical
components (such as Namenodes) can be configured to run in an HA pair with
automatic failover. There is quite a bit of work going on by many in the
Hadoop community to keep pushing the boundaries of scale.

A node or a rack failing in a large cluster actually has less impact than
at smaller scale. With a 5-node cluster, if 1 machine crashes you've taken
20% capacity (disk and compute) offline. 1 out of 1K barely registers.
Ditto with a 3 rack cluster. Loose a rack and 1/3rd of your capacity is
offline.

It is large-scale coordinated failure you should worry about. Think several
rows of racks coming offline due to power failure, a DC going offline due
to fire in the building etc. Those are hard to deal with in software within
a single DC. They should also be more rare, but as many companies have
experienced, large scale coordinated failures do occasionally happen.

As to your question in the other email thread, it is a well-established
pattern that scaling horizontally with commodity hardware (and letting
software such as Hadoop deal with failures) help with both scale and
reducing cost.

Cheers,

Joep

On Fri, May 27, 2016 at 11:02 AM, Arun Natva <ar...@gmail.com> wrote:

> Deepak,
> I have managed clusters where worker nodes crashed, disks failed..
> HDFS takes care of the data replication unless you loose too many of the
> nodes where there is not enough space to fit the replicas.
>
>
>
> Sent from my iPhone
>
> On May 27, 2016, at 11:54 AM, Deepak Goel <de...@gmail.com> wrote:
>
>
> Hey
>
> Namaskara~Nalama~Guten Tag~Bonjour
>
> We are yet to see any server go down in our cluster nodes in the
> production environment? Has anyone seen reliability problems in their
> production environment? How many times?
>
> Thanks
> Deepak
>    --
> Keigu
>
> Deepak
> 73500 12833
> www.simtree.net, deepak@simtree.net
> deicool@gmail.com
>
> LinkedIn: www.linkedin.com/in/deicool
> Skype: thumsupdeicool
> Google talk: deicool
> Blog: http://loveandfearless.wordpress.com
> Facebook: http://www.facebook.com/deicool
>
> "Contribute to the world, environment and more :
> http://www.gridrepublic.org
> "
>
>

Re: Reliability of Hadoop

Posted by Arun Natva <ar...@gmail.com>.

Deepak,
I have managed clusters where worker nodes crashed, disks failed..
HDFS takes care of the data replication unless you loose too many of the nodes where there is not enough space to fit the replicas.



Sent from my iPhone

> On May 27, 2016, at 11:54 AM, Deepak Goel <de...@gmail.com> wrote:
> 
> 
> Hey
> 
> Namaskara~Nalama~Guten Tag~Bonjour
> 
> We are yet to see any server go down in our cluster nodes in the production environment? Has anyone seen reliability problems in their production environment? How many times?
> 
> Thanks
> Deepak
>    -- 
> Keigu
> 
> Deepak
> 73500 12833
> www.simtree.net, deepak@simtree.net
> deicool@gmail.com
> 
> LinkedIn: www.linkedin.com/in/deicool
> Skype: thumsupdeicool
> Google talk: deicool
> Blog: http://loveandfearless.wordpress.com
> Facebook: http://www.facebook.com/deicool
> 
> "Contribute to the world, environment and more : http://www.gridrepublic.org
> "