You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hadoop.apache.org by Mohammad Tariq <do...@gmail.com> on 2012/11/28 23:09:19 UTC

Guidelines for production cluster

Hello list,

     Although a lot of similar discussions have been done here, I still
seek some of your able guidance. Till now I have worked only on small or
mid-sized clusters. But this time situation is a bit different. I have to
cpollect a lot of legacy data, stored over last few decades. This data is
on tape drives and I have to collect it from there and store in my cluster.
The size could go somewhere near 24 Petabytes (inclusive of replication).

Now, I need some help to kick this off, like what could be the optimal
config for my NN+JT, DN+TT+RS,  HMaster, ZK machines?

What should be the no. of slaves and ZK peers nodes keeping this config in
mind?

What is the optimal network config for a cluster of this size.

Which kind of disks would be more efficient?

Please do provide me some guidance as I want to have some expert comments
before moving ahead. Many thanks.

Regards,
    Mohammad Tariq

Re: Guidelines for production cluster

Posted by Mohammad Tariq <do...@gmail.com>.

Thanks again Gaurav. At least 1 file will be read at a time. File is the
atomic unit and each of these binary file can go upto 1TB in size.

Latency is not a major concern, since almost 99.99% of the stuff would be
offline and will involve batch processing. So, we can compromise there.

Thank you for the pointer. I'll definitely have a look over it.

Regards,
    Mohammad Tariq



On Fri, Nov 30, 2012 at 4:37 AM, Gaurav Sharma
<ga...@gmail.com>wrote:

> The 7th question should've been the first to rather obviate the need for
> some of the other 6. So, if the data is binary, MR is of little use anyway.
> Didn't understand and likely believe when you say this:
>
> "No, entire data is equally important and will be read together."
>
> Other than that, an 8th question:
> 8. how much read latency can the system tolerate?
>
> and a 9th:
> 9. what is the usable size of a unit of data being read? it being binary,
> does the entire stream have to be read to make sense of it for the
> application are parts of the binary usable?
>
>
> If you can get away with some read-latency, take a look at one of the
> commercial erasure coding solutions out there (like Cleversafe) or just
> code one yourself. Also, see:
> https://issues.apache.org/jira/browse/HDFS-503
>
> hth
>
>
>
> On Thu, Nov 29, 2012 at 2:19 AM, Mohammad Tariq <do...@gmail.com>wrote:
>
>> Hello Gaurav,
>>
>>     Thank you so much for your reply. Please find my comments embedded
>> below :
>>
>> 1. do you know if there exist patterns in this data?
>> >> Yes, entire file is divided into data blocks of fixed length (But
>> there is no separator between 2 blocks).
>>
>> 2. will the data be read and how?
>> >> Yes, data has to be read. To be honest, we are still not sure how to
>> do that.
>>
>> 3. does there exist a hot subset of the data - both read/write?
>> >> No, entire data is equally important and will be read together.
>>
>> 4. what makes you think hdfs is a good option?
>> >> Distributed architecture, Flexibility to read any kind of data,
>> Parallelism, Native MR integration, Cost, Fault tolerance, High
>> throughput etc.
>>
>> 5. how much do you intend to pay per TB?
>> >> I have to discuss it with my superiors (Will let you know soon).
>>
>> 6. say you do build the system, how do you plan to keep lights on?
>> >> I am sorry I did not get this. I mean i'll do whatever it takes to
>> keep everything moving. I have some experience with small clusters. And I
>> have got a small team with me which is ready 24*7.
>>
>> 7. forgot to ask - is the data textual or binary?
>> >> Data is binary.
>>
>> No, I would require some help. I have a team with me as I have said. But
>> being new to Hadoop I would need some help from whatever source it is.
>>
>> Many thanks.
>>
>> Regards,
>>     Mohammad Tariq
>>
>>
>>
>> On Thu, Nov 29, 2012 at 5:40 AM, Gaurav Sharma <
>> gaurav.gs.sharma@gmail.com> wrote:
>>
>>> So, before getting any suggestions, will have to explain a few core
>>> things:
>>>
>>> 1. do you know if there exist patterns in this data?
>>> 2. will the data be read and how?
>>> 3. does there exist a hot subset of the data - both read/write?
>>> 4. what makes you think hdfs is a good option?
>>> 5. how much do you intend to pay per TB?
>>> 6. say you do build the system, how do you plan to keep lights on?
>>> 7. forgot to ask - is the data textual or binary?
>>>
>>> Those are just the basic questions. Are you going to be building and
>>> running the system all by yourself?
>>>
>>>
>>> On Nov 28, 2012, at 14:09, Mohammad Tariq <do...@gmail.com> wrote:
>>>
>>> > Hello list,
>>> >
>>> >      Although a lot of similar discussions have been done here, I
>>> still seek some of your able guidance. Till now I have worked only on small
>>> or mid-sized clusters. But this time situation is a bit different. I have
>>> to cpollect a lot of legacy data, stored over last few decades. This data
>>> is on tape drives and I have to collect it from there and store in my
>>> cluster. The size could go somewhere near 24 Petabytes (inclusive of
>>> replication).
>>> >
>>> > Now, I need some help to kick this off, like what could be the optimal
>>> config for my NN+JT, DN+TT+RS,  HMaster, ZK machines?
>>> >
>>> > What should be the no. of slaves and ZK peers nodes keeping this
>>> config in mind?
>>> >
>>> > What is the optimal network config for a cluster of this size.
>>> >
>>> > Which kind of disks would be more efficient?
>>> >
>>> > Please do provide me some guidance as I want to have some expert
>>> comments before moving ahead. Many thanks.
>>> >
>>> > Regards,
>>> >     Mohammad Tariq
>>> >
>>>
>>
>>
>

Re: Guidelines for production cluster

Posted by Mohammad Tariq <do...@gmail.com>.

Thanks again Gaurav. At least 1 file will be read at a time. File is the
atomic unit and each of these binary file can go upto 1TB in size.

Latency is not a major concern, since almost 99.99% of the stuff would be
offline and will involve batch processing. So, we can compromise there.

Thank you for the pointer. I'll definitely have a look over it.

Regards,
    Mohammad Tariq



On Fri, Nov 30, 2012 at 4:37 AM, Gaurav Sharma
<ga...@gmail.com>wrote:

> The 7th question should've been the first to rather obviate the need for
> some of the other 6. So, if the data is binary, MR is of little use anyway.
> Didn't understand and likely believe when you say this:
>
> "No, entire data is equally important and will be read together."
>
> Other than that, an 8th question:
> 8. how much read latency can the system tolerate?
>
> and a 9th:
> 9. what is the usable size of a unit of data being read? it being binary,
> does the entire stream have to be read to make sense of it for the
> application are parts of the binary usable?
>
>
> If you can get away with some read-latency, take a look at one of the
> commercial erasure coding solutions out there (like Cleversafe) or just
> code one yourself. Also, see:
> https://issues.apache.org/jira/browse/HDFS-503
>
> hth
>
>
>
> On Thu, Nov 29, 2012 at 2:19 AM, Mohammad Tariq <do...@gmail.com>wrote:
>
>> Hello Gaurav,
>>
>>     Thank you so much for your reply. Please find my comments embedded
>> below :
>>
>> 1. do you know if there exist patterns in this data?
>> >> Yes, entire file is divided into data blocks of fixed length (But
>> there is no separator between 2 blocks).
>>
>> 2. will the data be read and how?
>> >> Yes, data has to be read. To be honest, we are still not sure how to
>> do that.
>>
>> 3. does there exist a hot subset of the data - both read/write?
>> >> No, entire data is equally important and will be read together.
>>
>> 4. what makes you think hdfs is a good option?
>> >> Distributed architecture, Flexibility to read any kind of data,
>> Parallelism, Native MR integration, Cost, Fault tolerance, High
>> throughput etc.
>>
>> 5. how much do you intend to pay per TB?
>> >> I have to discuss it with my superiors (Will let you know soon).
>>
>> 6. say you do build the system, how do you plan to keep lights on?
>> >> I am sorry I did not get this. I mean i'll do whatever it takes to
>> keep everything moving. I have some experience with small clusters. And I
>> have got a small team with me which is ready 24*7.
>>
>> 7. forgot to ask - is the data textual or binary?
>> >> Data is binary.
>>
>> No, I would require some help. I have a team with me as I have said. But
>> being new to Hadoop I would need some help from whatever source it is.
>>
>> Many thanks.
>>
>> Regards,
>>     Mohammad Tariq
>>
>>
>>
>> On Thu, Nov 29, 2012 at 5:40 AM, Gaurav Sharma <
>> gaurav.gs.sharma@gmail.com> wrote:
>>
>>> So, before getting any suggestions, will have to explain a few core
>>> things:
>>>
>>> 1. do you know if there exist patterns in this data?
>>> 2. will the data be read and how?
>>> 3. does there exist a hot subset of the data - both read/write?
>>> 4. what makes you think hdfs is a good option?
>>> 5. how much do you intend to pay per TB?
>>> 6. say you do build the system, how do you plan to keep lights on?
>>> 7. forgot to ask - is the data textual or binary?
>>>
>>> Those are just the basic questions. Are you going to be building and
>>> running the system all by yourself?
>>>
>>>
>>> On Nov 28, 2012, at 14:09, Mohammad Tariq <do...@gmail.com> wrote:
>>>
>>> > Hello list,
>>> >
>>> >      Although a lot of similar discussions have been done here, I
>>> still seek some of your able guidance. Till now I have worked only on small
>>> or mid-sized clusters. But this time situation is a bit different. I have
>>> to cpollect a lot of legacy data, stored over last few decades. This data
>>> is on tape drives and I have to collect it from there and store in my
>>> cluster. The size could go somewhere near 24 Petabytes (inclusive of
>>> replication).
>>> >
>>> > Now, I need some help to kick this off, like what could be the optimal
>>> config for my NN+JT, DN+TT+RS,  HMaster, ZK machines?
>>> >
>>> > What should be the no. of slaves and ZK peers nodes keeping this
>>> config in mind?
>>> >
>>> > What is the optimal network config for a cluster of this size.
>>> >
>>> > Which kind of disks would be more efficient?
>>> >
>>> > Please do provide me some guidance as I want to have some expert
>>> comments before moving ahead. Many thanks.
>>> >
>>> > Regards,
>>> >     Mohammad Tariq
>>> >
>>>
>>
>>
>

Re: Guidelines for production cluster

Posted by Mohammad Tariq <do...@gmail.com>.

Thanks again Gaurav. At least 1 file will be read at a time. File is the
atomic unit and each of these binary file can go upto 1TB in size.

Latency is not a major concern, since almost 99.99% of the stuff would be
offline and will involve batch processing. So, we can compromise there.

Thank you for the pointer. I'll definitely have a look over it.

Regards,
    Mohammad Tariq



On Fri, Nov 30, 2012 at 4:37 AM, Gaurav Sharma
<ga...@gmail.com>wrote:

> The 7th question should've been the first to rather obviate the need for
> some of the other 6. So, if the data is binary, MR is of little use anyway.
> Didn't understand and likely believe when you say this:
>
> "No, entire data is equally important and will be read together."
>
> Other than that, an 8th question:
> 8. how much read latency can the system tolerate?
>
> and a 9th:
> 9. what is the usable size of a unit of data being read? it being binary,
> does the entire stream have to be read to make sense of it for the
> application are parts of the binary usable?
>
>
> If you can get away with some read-latency, take a look at one of the
> commercial erasure coding solutions out there (like Cleversafe) or just
> code one yourself. Also, see:
> https://issues.apache.org/jira/browse/HDFS-503
>
> hth
>
>
>
> On Thu, Nov 29, 2012 at 2:19 AM, Mohammad Tariq <do...@gmail.com>wrote:
>
>> Hello Gaurav,
>>
>>     Thank you so much for your reply. Please find my comments embedded
>> below :
>>
>> 1. do you know if there exist patterns in this data?
>> >> Yes, entire file is divided into data blocks of fixed length (But
>> there is no separator between 2 blocks).
>>
>> 2. will the data be read and how?
>> >> Yes, data has to be read. To be honest, we are still not sure how to
>> do that.
>>
>> 3. does there exist a hot subset of the data - both read/write?
>> >> No, entire data is equally important and will be read together.
>>
>> 4. what makes you think hdfs is a good option?
>> >> Distributed architecture, Flexibility to read any kind of data,
>> Parallelism, Native MR integration, Cost, Fault tolerance, High
>> throughput etc.
>>
>> 5. how much do you intend to pay per TB?
>> >> I have to discuss it with my superiors (Will let you know soon).
>>
>> 6. say you do build the system, how do you plan to keep lights on?
>> >> I am sorry I did not get this. I mean i'll do whatever it takes to
>> keep everything moving. I have some experience with small clusters. And I
>> have got a small team with me which is ready 24*7.
>>
>> 7. forgot to ask - is the data textual or binary?
>> >> Data is binary.
>>
>> No, I would require some help. I have a team with me as I have said. But
>> being new to Hadoop I would need some help from whatever source it is.
>>
>> Many thanks.
>>
>> Regards,
>>     Mohammad Tariq
>>
>>
>>
>> On Thu, Nov 29, 2012 at 5:40 AM, Gaurav Sharma <
>> gaurav.gs.sharma@gmail.com> wrote:
>>
>>> So, before getting any suggestions, will have to explain a few core
>>> things:
>>>
>>> 1. do you know if there exist patterns in this data?
>>> 2. will the data be read and how?
>>> 3. does there exist a hot subset of the data - both read/write?
>>> 4. what makes you think hdfs is a good option?
>>> 5. how much do you intend to pay per TB?
>>> 6. say you do build the system, how do you plan to keep lights on?
>>> 7. forgot to ask - is the data textual or binary?
>>>
>>> Those are just the basic questions. Are you going to be building and
>>> running the system all by yourself?
>>>
>>>
>>> On Nov 28, 2012, at 14:09, Mohammad Tariq <do...@gmail.com> wrote:
>>>
>>> > Hello list,
>>> >
>>> >      Although a lot of similar discussions have been done here, I
>>> still seek some of your able guidance. Till now I have worked only on small
>>> or mid-sized clusters. But this time situation is a bit different. I have
>>> to cpollect a lot of legacy data, stored over last few decades. This data
>>> is on tape drives and I have to collect it from there and store in my
>>> cluster. The size could go somewhere near 24 Petabytes (inclusive of
>>> replication).
>>> >
>>> > Now, I need some help to kick this off, like what could be the optimal
>>> config for my NN+JT, DN+TT+RS,  HMaster, ZK machines?
>>> >
>>> > What should be the no. of slaves and ZK peers nodes keeping this
>>> config in mind?
>>> >
>>> > What is the optimal network config for a cluster of this size.
>>> >
>>> > Which kind of disks would be more efficient?
>>> >
>>> > Please do provide me some guidance as I want to have some expert
>>> comments before moving ahead. Many thanks.
>>> >
>>> > Regards,
>>> >     Mohammad Tariq
>>> >
>>>
>>
>>
>

Re: Guidelines for production cluster

Posted by Mohammad Tariq <do...@gmail.com>.

Thanks again Gaurav. At least 1 file will be read at a time. File is the
atomic unit and each of these binary file can go upto 1TB in size.

Latency is not a major concern, since almost 99.99% of the stuff would be
offline and will involve batch processing. So, we can compromise there.

Thank you for the pointer. I'll definitely have a look over it.

Regards,
    Mohammad Tariq



On Fri, Nov 30, 2012 at 4:37 AM, Gaurav Sharma
<ga...@gmail.com>wrote:

> The 7th question should've been the first to rather obviate the need for
> some of the other 6. So, if the data is binary, MR is of little use anyway.
> Didn't understand and likely believe when you say this:
>
> "No, entire data is equally important and will be read together."
>
> Other than that, an 8th question:
> 8. how much read latency can the system tolerate?
>
> and a 9th:
> 9. what is the usable size of a unit of data being read? it being binary,
> does the entire stream have to be read to make sense of it for the
> application are parts of the binary usable?
>
>
> If you can get away with some read-latency, take a look at one of the
> commercial erasure coding solutions out there (like Cleversafe) or just
> code one yourself. Also, see:
> https://issues.apache.org/jira/browse/HDFS-503
>
> hth
>
>
>
> On Thu, Nov 29, 2012 at 2:19 AM, Mohammad Tariq <do...@gmail.com>wrote:
>
>> Hello Gaurav,
>>
>>     Thank you so much for your reply. Please find my comments embedded
>> below :
>>
>> 1. do you know if there exist patterns in this data?
>> >> Yes, entire file is divided into data blocks of fixed length (But
>> there is no separator between 2 blocks).
>>
>> 2. will the data be read and how?
>> >> Yes, data has to be read. To be honest, we are still not sure how to
>> do that.
>>
>> 3. does there exist a hot subset of the data - both read/write?
>> >> No, entire data is equally important and will be read together.
>>
>> 4. what makes you think hdfs is a good option?
>> >> Distributed architecture, Flexibility to read any kind of data,
>> Parallelism, Native MR integration, Cost, Fault tolerance, High
>> throughput etc.
>>
>> 5. how much do you intend to pay per TB?
>> >> I have to discuss it with my superiors (Will let you know soon).
>>
>> 6. say you do build the system, how do you plan to keep lights on?
>> >> I am sorry I did not get this. I mean i'll do whatever it takes to
>> keep everything moving. I have some experience with small clusters. And I
>> have got a small team with me which is ready 24*7.
>>
>> 7. forgot to ask - is the data textual or binary?
>> >> Data is binary.
>>
>> No, I would require some help. I have a team with me as I have said. But
>> being new to Hadoop I would need some help from whatever source it is.
>>
>> Many thanks.
>>
>> Regards,
>>     Mohammad Tariq
>>
>>
>>
>> On Thu, Nov 29, 2012 at 5:40 AM, Gaurav Sharma <
>> gaurav.gs.sharma@gmail.com> wrote:
>>
>>> So, before getting any suggestions, will have to explain a few core
>>> things:
>>>
>>> 1. do you know if there exist patterns in this data?
>>> 2. will the data be read and how?
>>> 3. does there exist a hot subset of the data - both read/write?
>>> 4. what makes you think hdfs is a good option?
>>> 5. how much do you intend to pay per TB?
>>> 6. say you do build the system, how do you plan to keep lights on?
>>> 7. forgot to ask - is the data textual or binary?
>>>
>>> Those are just the basic questions. Are you going to be building and
>>> running the system all by yourself?
>>>
>>>
>>> On Nov 28, 2012, at 14:09, Mohammad Tariq <do...@gmail.com> wrote:
>>>
>>> > Hello list,
>>> >
>>> >      Although a lot of similar discussions have been done here, I
>>> still seek some of your able guidance. Till now I have worked only on small
>>> or mid-sized clusters. But this time situation is a bit different. I have
>>> to cpollect a lot of legacy data, stored over last few decades. This data
>>> is on tape drives and I have to collect it from there and store in my
>>> cluster. The size could go somewhere near 24 Petabytes (inclusive of
>>> replication).
>>> >
>>> > Now, I need some help to kick this off, like what could be the optimal
>>> config for my NN+JT, DN+TT+RS,  HMaster, ZK machines?
>>> >
>>> > What should be the no. of slaves and ZK peers nodes keeping this
>>> config in mind?
>>> >
>>> > What is the optimal network config for a cluster of this size.
>>> >
>>> > Which kind of disks would be more efficient?
>>> >
>>> > Please do provide me some guidance as I want to have some expert
>>> comments before moving ahead. Many thanks.
>>> >
>>> > Regards,
>>> >     Mohammad Tariq
>>> >
>>>
>>
>>
>

Re: Guidelines for production cluster

Posted by Gaurav Sharma <ga...@gmail.com>.

The 7th question should've been the first to rather obviate the need for
some of the other 6. So, if the data is binary, MR is of little use anyway.
Didn't understand and likely believe when you say this:
"No, entire data is equally important and will be read together."

Other than that, an 8th question:
8. how much read latency can the system tolerate?

and a 9th:
9. what is the usable size of a unit of data being read? it being binary,
does the entire stream have to be read to make sense of it for the
application are parts of the binary usable?


If you can get away with some read-latency, take a look at one of the
commercial erasure coding solutions out there (like Cleversafe) or just
code one yourself. Also, see: https://issues.apache.org/jira/browse/HDFS-503

hth


On Thu, Nov 29, 2012 at 2:19 AM, Mohammad Tariq <do...@gmail.com> wrote:

> Hello Gaurav,
>
>     Thank you so much for your reply. Please find my comments embedded
> below :
>
> 1. do you know if there exist patterns in this data?
> >> Yes, entire file is divided into data blocks of fixed length (But there
> is no separator between 2 blocks).
>
> 2. will the data be read and how?
> >> Yes, data has to be read. To be honest, we are still not sure how to do
> that.
>
> 3. does there exist a hot subset of the data - both read/write?
> >> No, entire data is equally important and will be read together.
>
> 4. what makes you think hdfs is a good option?
> >> Distributed architecture, Flexibility to read any kind of data,
> Parallelism, Native MR integration, Cost, Fault tolerance, High
> throughput etc.
>
> 5. how much do you intend to pay per TB?
> >> I have to discuss it with my superiors (Will let you know soon).
>
> 6. say you do build the system, how do you plan to keep lights on?
> >> I am sorry I did not get this. I mean i'll do whatever it takes to keep
> everything moving. I have some experience with small clusters. And I have
> got a small team with me which is ready 24*7.
>
> 7. forgot to ask - is the data textual or binary?
> >> Data is binary.
>
> No, I would require some help. I have a team with me as I have said. But
> being new to Hadoop I would need some help from whatever source it is.
>
> Many thanks.
>
> Regards,
>     Mohammad Tariq
>
>
>
> On Thu, Nov 29, 2012 at 5:40 AM, Gaurav Sharma <gaurav.gs.sharma@gmail.com
> > wrote:
>
>> So, before getting any suggestions, will have to explain a few core
>> things:
>>
>> 1. do you know if there exist patterns in this data?
>> 2. will the data be read and how?
>> 3. does there exist a hot subset of the data - both read/write?
>> 4. what makes you think hdfs is a good option?
>> 5. how much do you intend to pay per TB?
>> 6. say you do build the system, how do you plan to keep lights on?
>> 7. forgot to ask - is the data textual or binary?
>>
>> Those are just the basic questions. Are you going to be building and
>> running the system all by yourself?
>>
>>
>> On Nov 28, 2012, at 14:09, Mohammad Tariq <do...@gmail.com> wrote:
>>
>> > Hello list,
>> >
>> >      Although a lot of similar discussions have been done here, I still
>> seek some of your able guidance. Till now I have worked only on small or
>> mid-sized clusters. But this time situation is a bit different. I have to
>> cpollect a lot of legacy data, stored over last few decades. This data is
>> on tape drives and I have to collect it from there and store in my cluster.
>> The size could go somewhere near 24 Petabytes (inclusive of replication).
>> >
>> > Now, I need some help to kick this off, like what could be the optimal
>> config for my NN+JT, DN+TT+RS,  HMaster, ZK machines?
>> >
>> > What should be the no. of slaves and ZK peers nodes keeping this config
>> in mind?
>> >
>> > What is the optimal network config for a cluster of this size.
>> >
>> > Which kind of disks would be more efficient?
>> >
>> > Please do provide me some guidance as I want to have some expert
>> comments before moving ahead. Many thanks.
>> >
>> > Regards,
>> >     Mohammad Tariq
>> >
>>
>
>

Re: Guidelines for production cluster

Posted by Gaurav Sharma <ga...@gmail.com>.

The 7th question should've been the first to rather obviate the need for
some of the other 6. So, if the data is binary, MR is of little use anyway.
Didn't understand and likely believe when you say this:
"No, entire data is equally important and will be read together."

Other than that, an 8th question:
8. how much read latency can the system tolerate?

and a 9th:
9. what is the usable size of a unit of data being read? it being binary,
does the entire stream have to be read to make sense of it for the
application are parts of the binary usable?


If you can get away with some read-latency, take a look at one of the
commercial erasure coding solutions out there (like Cleversafe) or just
code one yourself. Also, see: https://issues.apache.org/jira/browse/HDFS-503

hth


On Thu, Nov 29, 2012 at 2:19 AM, Mohammad Tariq <do...@gmail.com> wrote:

> Hello Gaurav,
>
>     Thank you so much for your reply. Please find my comments embedded
> below :
>
> 1. do you know if there exist patterns in this data?
> >> Yes, entire file is divided into data blocks of fixed length (But there
> is no separator between 2 blocks).
>
> 2. will the data be read and how?
> >> Yes, data has to be read. To be honest, we are still not sure how to do
> that.
>
> 3. does there exist a hot subset of the data - both read/write?
> >> No, entire data is equally important and will be read together.
>
> 4. what makes you think hdfs is a good option?
> >> Distributed architecture, Flexibility to read any kind of data,
> Parallelism, Native MR integration, Cost, Fault tolerance, High
> throughput etc.
>
> 5. how much do you intend to pay per TB?
> >> I have to discuss it with my superiors (Will let you know soon).
>
> 6. say you do build the system, how do you plan to keep lights on?
> >> I am sorry I did not get this. I mean i'll do whatever it takes to keep
> everything moving. I have some experience with small clusters. And I have
> got a small team with me which is ready 24*7.
>
> 7. forgot to ask - is the data textual or binary?
> >> Data is binary.
>
> No, I would require some help. I have a team with me as I have said. But
> being new to Hadoop I would need some help from whatever source it is.
>
> Many thanks.
>
> Regards,
>     Mohammad Tariq
>
>
>
> On Thu, Nov 29, 2012 at 5:40 AM, Gaurav Sharma <gaurav.gs.sharma@gmail.com
> > wrote:
>
>> So, before getting any suggestions, will have to explain a few core
>> things:
>>
>> 1. do you know if there exist patterns in this data?
>> 2. will the data be read and how?
>> 3. does there exist a hot subset of the data - both read/write?
>> 4. what makes you think hdfs is a good option?
>> 5. how much do you intend to pay per TB?
>> 6. say you do build the system, how do you plan to keep lights on?
>> 7. forgot to ask - is the data textual or binary?
>>
>> Those are just the basic questions. Are you going to be building and
>> running the system all by yourself?
>>
>>
>> On Nov 28, 2012, at 14:09, Mohammad Tariq <do...@gmail.com> wrote:
>>
>> > Hello list,
>> >
>> >      Although a lot of similar discussions have been done here, I still
>> seek some of your able guidance. Till now I have worked only on small or
>> mid-sized clusters. But this time situation is a bit different. I have to
>> cpollect a lot of legacy data, stored over last few decades. This data is
>> on tape drives and I have to collect it from there and store in my cluster.
>> The size could go somewhere near 24 Petabytes (inclusive of replication).
>> >
>> > Now, I need some help to kick this off, like what could be the optimal
>> config for my NN+JT, DN+TT+RS,  HMaster, ZK machines?
>> >
>> > What should be the no. of slaves and ZK peers nodes keeping this config
>> in mind?
>> >
>> > What is the optimal network config for a cluster of this size.
>> >
>> > Which kind of disks would be more efficient?
>> >
>> > Please do provide me some guidance as I want to have some expert
>> comments before moving ahead. Many thanks.
>> >
>> > Regards,
>> >     Mohammad Tariq
>> >
>>
>
>

Re: Guidelines for production cluster

Posted by Gaurav Sharma <ga...@gmail.com>.

The 7th question should've been the first to rather obviate the need for
some of the other 6. So, if the data is binary, MR is of little use anyway.
Didn't understand and likely believe when you say this:
"No, entire data is equally important and will be read together."

Other than that, an 8th question:
8. how much read latency can the system tolerate?

and a 9th:
9. what is the usable size of a unit of data being read? it being binary,
does the entire stream have to be read to make sense of it for the
application are parts of the binary usable?


If you can get away with some read-latency, take a look at one of the
commercial erasure coding solutions out there (like Cleversafe) or just
code one yourself. Also, see: https://issues.apache.org/jira/browse/HDFS-503

hth


On Thu, Nov 29, 2012 at 2:19 AM, Mohammad Tariq <do...@gmail.com> wrote:

> Hello Gaurav,
>
>     Thank you so much for your reply. Please find my comments embedded
> below :
>
> 1. do you know if there exist patterns in this data?
> >> Yes, entire file is divided into data blocks of fixed length (But there
> is no separator between 2 blocks).
>
> 2. will the data be read and how?
> >> Yes, data has to be read. To be honest, we are still not sure how to do
> that.
>
> 3. does there exist a hot subset of the data - both read/write?
> >> No, entire data is equally important and will be read together.
>
> 4. what makes you think hdfs is a good option?
> >> Distributed architecture, Flexibility to read any kind of data,
> Parallelism, Native MR integration, Cost, Fault tolerance, High
> throughput etc.
>
> 5. how much do you intend to pay per TB?
> >> I have to discuss it with my superiors (Will let you know soon).
>
> 6. say you do build the system, how do you plan to keep lights on?
> >> I am sorry I did not get this. I mean i'll do whatever it takes to keep
> everything moving. I have some experience with small clusters. And I have
> got a small team with me which is ready 24*7.
>
> 7. forgot to ask - is the data textual or binary?
> >> Data is binary.
>
> No, I would require some help. I have a team with me as I have said. But
> being new to Hadoop I would need some help from whatever source it is.
>
> Many thanks.
>
> Regards,
>     Mohammad Tariq
>
>
>
> On Thu, Nov 29, 2012 at 5:40 AM, Gaurav Sharma <gaurav.gs.sharma@gmail.com
> > wrote:
>
>> So, before getting any suggestions, will have to explain a few core
>> things:
>>
>> 1. do you know if there exist patterns in this data?
>> 2. will the data be read and how?
>> 3. does there exist a hot subset of the data - both read/write?
>> 4. what makes you think hdfs is a good option?
>> 5. how much do you intend to pay per TB?
>> 6. say you do build the system, how do you plan to keep lights on?
>> 7. forgot to ask - is the data textual or binary?
>>
>> Those are just the basic questions. Are you going to be building and
>> running the system all by yourself?
>>
>>
>> On Nov 28, 2012, at 14:09, Mohammad Tariq <do...@gmail.com> wrote:
>>
>> > Hello list,
>> >
>> >      Although a lot of similar discussions have been done here, I still
>> seek some of your able guidance. Till now I have worked only on small or
>> mid-sized clusters. But this time situation is a bit different. I have to
>> cpollect a lot of legacy data, stored over last few decades. This data is
>> on tape drives and I have to collect it from there and store in my cluster.
>> The size could go somewhere near 24 Petabytes (inclusive of replication).
>> >
>> > Now, I need some help to kick this off, like what could be the optimal
>> config for my NN+JT, DN+TT+RS,  HMaster, ZK machines?
>> >
>> > What should be the no. of slaves and ZK peers nodes keeping this config
>> in mind?
>> >
>> > What is the optimal network config for a cluster of this size.
>> >
>> > Which kind of disks would be more efficient?
>> >
>> > Please do provide me some guidance as I want to have some expert
>> comments before moving ahead. Many thanks.
>> >
>> > Regards,
>> >     Mohammad Tariq
>> >
>>
>
>

Re: Guidelines for production cluster

Posted by Gaurav Sharma <ga...@gmail.com>.

The 7th question should've been the first to rather obviate the need for
some of the other 6. So, if the data is binary, MR is of little use anyway.
Didn't understand and likely believe when you say this:
"No, entire data is equally important and will be read together."

Other than that, an 8th question:
8. how much read latency can the system tolerate?

and a 9th:
9. what is the usable size of a unit of data being read? it being binary,
does the entire stream have to be read to make sense of it for the
application are parts of the binary usable?


If you can get away with some read-latency, take a look at one of the
commercial erasure coding solutions out there (like Cleversafe) or just
code one yourself. Also, see: https://issues.apache.org/jira/browse/HDFS-503

hth


On Thu, Nov 29, 2012 at 2:19 AM, Mohammad Tariq <do...@gmail.com> wrote:

> Hello Gaurav,
>
>     Thank you so much for your reply. Please find my comments embedded
> below :
>
> 1. do you know if there exist patterns in this data?
> >> Yes, entire file is divided into data blocks of fixed length (But there
> is no separator between 2 blocks).
>
> 2. will the data be read and how?
> >> Yes, data has to be read. To be honest, we are still not sure how to do
> that.
>
> 3. does there exist a hot subset of the data - both read/write?
> >> No, entire data is equally important and will be read together.
>
> 4. what makes you think hdfs is a good option?
> >> Distributed architecture, Flexibility to read any kind of data,
> Parallelism, Native MR integration, Cost, Fault tolerance, High
> throughput etc.
>
> 5. how much do you intend to pay per TB?
> >> I have to discuss it with my superiors (Will let you know soon).
>
> 6. say you do build the system, how do you plan to keep lights on?
> >> I am sorry I did not get this. I mean i'll do whatever it takes to keep
> everything moving. I have some experience with small clusters. And I have
> got a small team with me which is ready 24*7.
>
> 7. forgot to ask - is the data textual or binary?
> >> Data is binary.
>
> No, I would require some help. I have a team with me as I have said. But
> being new to Hadoop I would need some help from whatever source it is.
>
> Many thanks.
>
> Regards,
>     Mohammad Tariq
>
>
>
> On Thu, Nov 29, 2012 at 5:40 AM, Gaurav Sharma <gaurav.gs.sharma@gmail.com
> > wrote:
>
>> So, before getting any suggestions, will have to explain a few core
>> things:
>>
>> 1. do you know if there exist patterns in this data?
>> 2. will the data be read and how?
>> 3. does there exist a hot subset of the data - both read/write?
>> 4. what makes you think hdfs is a good option?
>> 5. how much do you intend to pay per TB?
>> 6. say you do build the system, how do you plan to keep lights on?
>> 7. forgot to ask - is the data textual or binary?
>>
>> Those are just the basic questions. Are you going to be building and
>> running the system all by yourself?
>>
>>
>> On Nov 28, 2012, at 14:09, Mohammad Tariq <do...@gmail.com> wrote:
>>
>> > Hello list,
>> >
>> >      Although a lot of similar discussions have been done here, I still
>> seek some of your able guidance. Till now I have worked only on small or
>> mid-sized clusters. But this time situation is a bit different. I have to
>> cpollect a lot of legacy data, stored over last few decades. This data is
>> on tape drives and I have to collect it from there and store in my cluster.
>> The size could go somewhere near 24 Petabytes (inclusive of replication).
>> >
>> > Now, I need some help to kick this off, like what could be the optimal
>> config for my NN+JT, DN+TT+RS,  HMaster, ZK machines?
>> >
>> > What should be the no. of slaves and ZK peers nodes keeping this config
>> in mind?
>> >
>> > What is the optimal network config for a cluster of this size.
>> >
>> > Which kind of disks would be more efficient?
>> >
>> > Please do provide me some guidance as I want to have some expert
>> comments before moving ahead. Many thanks.
>> >
>> > Regards,
>> >     Mohammad Tariq
>> >
>>
>
>

Re: Guidelines for production cluster

Posted by Mohammad Tariq <do...@gmail.com>.

Hello Gaurav,

    Thank you so much for your reply. Please find my comments embedded
below :

1. do you know if there exist patterns in this data?
>> Yes, entire file is divided into data blocks of fixed length (But there
is no separator between 2 blocks).

2. will the data be read and how?
>> Yes, data has to be read. To be honest, we are still not sure how to do
that.

3. does there exist a hot subset of the data - both read/write?
>> No, entire data is equally important and will be read together.

4. what makes you think hdfs is a good option?
>> Distributed architecture, Flexibility to read any kind of data,
Parallelism, Native MR integration, Cost, Fault tolerance, High
throughput etc.

5. how much do you intend to pay per TB?
>> I have to discuss it with my superiors (Will let you know soon).

6. say you do build the system, how do you plan to keep lights on?
>> I am sorry I did not get this. I mean i'll do whatever it takes to keep
everything moving. I have some experience with small clusters. And I have
got a small team with me which is ready 24*7.

7. forgot to ask - is the data textual or binary?
>> Data is binary.

No, I would require some help. I have a team with me as I have said. But
being new to Hadoop I would need some help from whatever source it is.

Many thanks.

Regards,
    Mohammad Tariq



On Thu, Nov 29, 2012 at 5:40 AM, Gaurav Sharma
<ga...@gmail.com>wrote:

> So, before getting any suggestions, will have to explain a few core things:
>
> 1. do you know if there exist patterns in this data?
> 2. will the data be read and how?
> 3. does there exist a hot subset of the data - both read/write?
> 4. what makes you think hdfs is a good option?
> 5. how much do you intend to pay per TB?
> 6. say you do build the system, how do you plan to keep lights on?
> 7. forgot to ask - is the data textual or binary?
>
> Those are just the basic questions. Are you going to be building and
> running the system all by yourself?
>
>
> On Nov 28, 2012, at 14:09, Mohammad Tariq <do...@gmail.com> wrote:
>
> > Hello list,
> >
> >      Although a lot of similar discussions have been done here, I still
> seek some of your able guidance. Till now I have worked only on small or
> mid-sized clusters. But this time situation is a bit different. I have to
> cpollect a lot of legacy data, stored over last few decades. This data is
> on tape drives and I have to collect it from there and store in my cluster.
> The size could go somewhere near 24 Petabytes (inclusive of replication).
> >
> > Now, I need some help to kick this off, like what could be the optimal
> config for my NN+JT, DN+TT+RS,  HMaster, ZK machines?
> >
> > What should be the no. of slaves and ZK peers nodes keeping this config
> in mind?
> >
> > What is the optimal network config for a cluster of this size.
> >
> > Which kind of disks would be more efficient?
> >
> > Please do provide me some guidance as I want to have some expert
> comments before moving ahead. Many thanks.
> >
> > Regards,
> >     Mohammad Tariq
> >
>

Re: Guidelines for production cluster

Posted by Mohammad Tariq <do...@gmail.com>.

Hello Gaurav,

    Thank you so much for your reply. Please find my comments embedded
below :

1. do you know if there exist patterns in this data?
>> Yes, entire file is divided into data blocks of fixed length (But there
is no separator between 2 blocks).

2. will the data be read and how?
>> Yes, data has to be read. To be honest, we are still not sure how to do
that.

3. does there exist a hot subset of the data - both read/write?
>> No, entire data is equally important and will be read together.

4. what makes you think hdfs is a good option?
>> Distributed architecture, Flexibility to read any kind of data,
Parallelism, Native MR integration, Cost, Fault tolerance, High
throughput etc.

5. how much do you intend to pay per TB?
>> I have to discuss it with my superiors (Will let you know soon).

6. say you do build the system, how do you plan to keep lights on?
>> I am sorry I did not get this. I mean i'll do whatever it takes to keep
everything moving. I have some experience with small clusters. And I have
got a small team with me which is ready 24*7.

7. forgot to ask - is the data textual or binary?
>> Data is binary.

No, I would require some help. I have a team with me as I have said. But
being new to Hadoop I would need some help from whatever source it is.

Many thanks.

Regards,
    Mohammad Tariq



On Thu, Nov 29, 2012 at 5:40 AM, Gaurav Sharma
<ga...@gmail.com>wrote:

> So, before getting any suggestions, will have to explain a few core things:
>
> 1. do you know if there exist patterns in this data?
> 2. will the data be read and how?
> 3. does there exist a hot subset of the data - both read/write?
> 4. what makes you think hdfs is a good option?
> 5. how much do you intend to pay per TB?
> 6. say you do build the system, how do you plan to keep lights on?
> 7. forgot to ask - is the data textual or binary?
>
> Those are just the basic questions. Are you going to be building and
> running the system all by yourself?
>
>
> On Nov 28, 2012, at 14:09, Mohammad Tariq <do...@gmail.com> wrote:
>
> > Hello list,
> >
> >      Although a lot of similar discussions have been done here, I still
> seek some of your able guidance. Till now I have worked only on small or
> mid-sized clusters. But this time situation is a bit different. I have to
> cpollect a lot of legacy data, stored over last few decades. This data is
> on tape drives and I have to collect it from there and store in my cluster.
> The size could go somewhere near 24 Petabytes (inclusive of replication).
> >
> > Now, I need some help to kick this off, like what could be the optimal
> config for my NN+JT, DN+TT+RS,  HMaster, ZK machines?
> >
> > What should be the no. of slaves and ZK peers nodes keeping this config
> in mind?
> >
> > What is the optimal network config for a cluster of this size.
> >
> > Which kind of disks would be more efficient?
> >
> > Please do provide me some guidance as I want to have some expert
> comments before moving ahead. Many thanks.
> >
> > Regards,
> >     Mohammad Tariq
> >
>

Re: Guidelines for production cluster

Posted by Mohammad Tariq <do...@gmail.com>.

Hello Gaurav,

    Thank you so much for your reply. Please find my comments embedded
below :

1. do you know if there exist patterns in this data?
>> Yes, entire file is divided into data blocks of fixed length (But there
is no separator between 2 blocks).

2. will the data be read and how?
>> Yes, data has to be read. To be honest, we are still not sure how to do
that.

3. does there exist a hot subset of the data - both read/write?
>> No, entire data is equally important and will be read together.

4. what makes you think hdfs is a good option?
>> Distributed architecture, Flexibility to read any kind of data,
Parallelism, Native MR integration, Cost, Fault tolerance, High
throughput etc.

5. how much do you intend to pay per TB?
>> I have to discuss it with my superiors (Will let you know soon).

6. say you do build the system, how do you plan to keep lights on?
>> I am sorry I did not get this. I mean i'll do whatever it takes to keep
everything moving. I have some experience with small clusters. And I have
got a small team with me which is ready 24*7.

7. forgot to ask - is the data textual or binary?
>> Data is binary.

No, I would require some help. I have a team with me as I have said. But
being new to Hadoop I would need some help from whatever source it is.

Many thanks.

Regards,
    Mohammad Tariq



On Thu, Nov 29, 2012 at 5:40 AM, Gaurav Sharma
<ga...@gmail.com>wrote:

> So, before getting any suggestions, will have to explain a few core things:
>
> 1. do you know if there exist patterns in this data?
> 2. will the data be read and how?
> 3. does there exist a hot subset of the data - both read/write?
> 4. what makes you think hdfs is a good option?
> 5. how much do you intend to pay per TB?
> 6. say you do build the system, how do you plan to keep lights on?
> 7. forgot to ask - is the data textual or binary?
>
> Those are just the basic questions. Are you going to be building and
> running the system all by yourself?
>
>
> On Nov 28, 2012, at 14:09, Mohammad Tariq <do...@gmail.com> wrote:
>
> > Hello list,
> >
> >      Although a lot of similar discussions have been done here, I still
> seek some of your able guidance. Till now I have worked only on small or
> mid-sized clusters. But this time situation is a bit different. I have to
> cpollect a lot of legacy data, stored over last few decades. This data is
> on tape drives and I have to collect it from there and store in my cluster.
> The size could go somewhere near 24 Petabytes (inclusive of replication).
> >
> > Now, I need some help to kick this off, like what could be the optimal
> config for my NN+JT, DN+TT+RS,  HMaster, ZK machines?
> >
> > What should be the no. of slaves and ZK peers nodes keeping this config
> in mind?
> >
> > What is the optimal network config for a cluster of this size.
> >
> > Which kind of disks would be more efficient?
> >
> > Please do provide me some guidance as I want to have some expert
> comments before moving ahead. Many thanks.
> >
> > Regards,
> >     Mohammad Tariq
> >
>

Re: Guidelines for production cluster

Posted by Mohammad Tariq <do...@gmail.com>.

Hello Gaurav,

    Thank you so much for your reply. Please find my comments embedded
below :

1. do you know if there exist patterns in this data?
>> Yes, entire file is divided into data blocks of fixed length (But there
is no separator between 2 blocks).

2. will the data be read and how?
>> Yes, data has to be read. To be honest, we are still not sure how to do
that.

3. does there exist a hot subset of the data - both read/write?
>> No, entire data is equally important and will be read together.

4. what makes you think hdfs is a good option?
>> Distributed architecture, Flexibility to read any kind of data,
Parallelism, Native MR integration, Cost, Fault tolerance, High
throughput etc.

5. how much do you intend to pay per TB?
>> I have to discuss it with my superiors (Will let you know soon).

6. say you do build the system, how do you plan to keep lights on?
>> I am sorry I did not get this. I mean i'll do whatever it takes to keep
everything moving. I have some experience with small clusters. And I have
got a small team with me which is ready 24*7.

7. forgot to ask - is the data textual or binary?
>> Data is binary.

No, I would require some help. I have a team with me as I have said. But
being new to Hadoop I would need some help from whatever source it is.

Many thanks.

Regards,
    Mohammad Tariq



On Thu, Nov 29, 2012 at 5:40 AM, Gaurav Sharma
<ga...@gmail.com>wrote:

> So, before getting any suggestions, will have to explain a few core things:
>
> 1. do you know if there exist patterns in this data?
> 2. will the data be read and how?
> 3. does there exist a hot subset of the data - both read/write?
> 4. what makes you think hdfs is a good option?
> 5. how much do you intend to pay per TB?
> 6. say you do build the system, how do you plan to keep lights on?
> 7. forgot to ask - is the data textual or binary?
>
> Those are just the basic questions. Are you going to be building and
> running the system all by yourself?
>
>
> On Nov 28, 2012, at 14:09, Mohammad Tariq <do...@gmail.com> wrote:
>
> > Hello list,
> >
> >      Although a lot of similar discussions have been done here, I still
> seek some of your able guidance. Till now I have worked only on small or
> mid-sized clusters. But this time situation is a bit different. I have to
> cpollect a lot of legacy data, stored over last few decades. This data is
> on tape drives and I have to collect it from there and store in my cluster.
> The size could go somewhere near 24 Petabytes (inclusive of replication).
> >
> > Now, I need some help to kick this off, like what could be the optimal
> config for my NN+JT, DN+TT+RS,  HMaster, ZK machines?
> >
> > What should be the no. of slaves and ZK peers nodes keeping this config
> in mind?
> >
> > What is the optimal network config for a cluster of this size.
> >
> > Which kind of disks would be more efficient?
> >
> > Please do provide me some guidance as I want to have some expert
> comments before moving ahead. Many thanks.
> >
> > Regards,
> >     Mohammad Tariq
> >
>

Re: Guidelines for production cluster

Posted by Gaurav Sharma <ga...@gmail.com>.

So, before getting any suggestions, will have to explain a few core things:

1. do you know if there exist patterns in this data?
2. will the data be read and how?
3. does there exist a hot subset of the data - both read/write?
4. what makes you think hdfs is a good option?
5. how much do you intend to pay per TB?
6. say you do build the system, how do you plan to keep lights on?
7. forgot to ask - is the data textual or binary?

Those are just the basic questions. Are you going to be building and running the system all by yourself?

On Nov 28, 2012, at 14:09, Mohammad Tariq <do...@gmail.com> wrote:

> Hello list,
> 
>      Although a lot of similar discussions have been done here, I still seek some of your able guidance. Till now I have worked only on small or mid-sized clusters. But this time situation is a bit different. I have to cpollect a lot of legacy data, stored over last few decades. This data is on tape drives and I have to collect it from there and store in my cluster. The size could go somewhere near 24 Petabytes (inclusive of replication).
> 
> Now, I need some help to kick this off, like what could be the optimal config for my NN+JT, DN+TT+RS,  HMaster, ZK machines? 
> 
> What should be the no. of slaves and ZK peers nodes keeping this config in mind?
> 
> What is the optimal network config for a cluster of this size.
> 
> Which kind of disks would be more efficient?
> 
> Please do provide me some guidance as I want to have some expert comments before moving ahead. Many thanks.
> 
> Regards,
>     Mohammad Tariq
>

Re: Guidelines for production cluster

Posted by Gaurav Sharma <ga...@gmail.com>.

So, before getting any suggestions, will have to explain a few core things:

1. do you know if there exist patterns in this data?
2. will the data be read and how?
3. does there exist a hot subset of the data - both read/write?
4. what makes you think hdfs is a good option?
5. how much do you intend to pay per TB?
6. say you do build the system, how do you plan to keep lights on?
7. forgot to ask - is the data textual or binary?

Those are just the basic questions. Are you going to be building and running the system all by yourself?

On Nov 28, 2012, at 14:09, Mohammad Tariq <do...@gmail.com> wrote:

> Hello list,
> 
>      Although a lot of similar discussions have been done here, I still seek some of your able guidance. Till now I have worked only on small or mid-sized clusters. But this time situation is a bit different. I have to cpollect a lot of legacy data, stored over last few decades. This data is on tape drives and I have to collect it from there and store in my cluster. The size could go somewhere near 24 Petabytes (inclusive of replication).
> 
> Now, I need some help to kick this off, like what could be the optimal config for my NN+JT, DN+TT+RS,  HMaster, ZK machines? 
> 
> What should be the no. of slaves and ZK peers nodes keeping this config in mind?
> 
> What is the optimal network config for a cluster of this size.
> 
> Which kind of disks would be more efficient?
> 
> Please do provide me some guidance as I want to have some expert comments before moving ahead. Many thanks.
> 
> Regards,
>     Mohammad Tariq
>

Re: Guidelines for production cluster

Posted by Gaurav Sharma <ga...@gmail.com>.

So, before getting any suggestions, will have to explain a few core things:

1. do you know if there exist patterns in this data?
2. will the data be read and how?
3. does there exist a hot subset of the data - both read/write?
4. what makes you think hdfs is a good option?
5. how much do you intend to pay per TB?
6. say you do build the system, how do you plan to keep lights on?
7. forgot to ask - is the data textual or binary?

Those are just the basic questions. Are you going to be building and running the system all by yourself?

On Nov 28, 2012, at 14:09, Mohammad Tariq <do...@gmail.com> wrote:

> Hello list,
> 
>      Although a lot of similar discussions have been done here, I still seek some of your able guidance. Till now I have worked only on small or mid-sized clusters. But this time situation is a bit different. I have to cpollect a lot of legacy data, stored over last few decades. This data is on tape drives and I have to collect it from there and store in my cluster. The size could go somewhere near 24 Petabytes (inclusive of replication).
> 
> Now, I need some help to kick this off, like what could be the optimal config for my NN+JT, DN+TT+RS,  HMaster, ZK machines? 
> 
> What should be the no. of slaves and ZK peers nodes keeping this config in mind?
> 
> What is the optimal network config for a cluster of this size.
> 
> Which kind of disks would be more efficient?
> 
> Please do provide me some guidance as I want to have some expert comments before moving ahead. Many thanks.
> 
> Regards,
>     Mohammad Tariq
>

Re: Guidelines for production cluster

Posted by Gaurav Sharma <ga...@gmail.com>.

So, before getting any suggestions, will have to explain a few core things:

1. do you know if there exist patterns in this data?
2. will the data be read and how?
3. does there exist a hot subset of the data - both read/write?
4. what makes you think hdfs is a good option?
5. how much do you intend to pay per TB?
6. say you do build the system, how do you plan to keep lights on?
7. forgot to ask - is the data textual or binary?

Those are just the basic questions. Are you going to be building and running the system all by yourself?

On Nov 28, 2012, at 14:09, Mohammad Tariq <do...@gmail.com> wrote:

> Hello list,
> 
>      Although a lot of similar discussions have been done here, I still seek some of your able guidance. Till now I have worked only on small or mid-sized clusters. But this time situation is a bit different. I have to cpollect a lot of legacy data, stored over last few decades. This data is on tape drives and I have to collect it from there and store in my cluster. The size could go somewhere near 24 Petabytes (inclusive of replication).
> 
> Now, I need some help to kick this off, like what could be the optimal config for my NN+JT, DN+TT+RS,  HMaster, ZK machines? 
> 
> What should be the no. of slaves and ZK peers nodes keeping this config in mind?
> 
> What is the optimal network config for a cluster of this size.
> 
> Which kind of disks would be more efficient?
> 
> Please do provide me some guidance as I want to have some expert comments before moving ahead. Many thanks.
> 
> Regards,
>     Mohammad Tariq
>