You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Panshul Whisper <ou...@gmail.com> on 2013/01/18 13:11:20 UTC

Estimating disk space requirements

Hello,

I was estimating how much disk space do I need for my cluster.

I have 24 million JSON documents approx. 5kb each
the Json is to be stored into HBASE with some identifying data in coloumns
and I also want to store the Json for later retrieval based on the Id data
as keys in Hbase.
I have my HDFS replication set to 3
each node has Hadoop and hbase and Ubuntu installed on it.. so approx 11 GB
is available for use on my 20 GB node.

I have no idea, if I have not enabled Hbase replication, is the HDFS
replication enough to keep the data safe and redundant.
How much total disk space I will need for the storage of the data.

Please help me estimate this.

Thank you so much.

-- 
Regards,
Ouch Whisper
010101010101

Re: Estimating disk space requirements

Posted by Mirko Kämpf <mi...@gmail.com>.

Hi,

some comments are inside your message ...


2013/1/18 Panshul Whisper <ou...@gmail.com>

> Hello,
>
> I was estimating how much disk space do I need for my cluster.
>
> I have 24 million JSON documents approx. 5kb each
> the Json is to be stored into HBASE with some identifying data in coloumns
> and I also want to store the Json for later retrieval based on the Id data
> as keys in Hbase.
> I have my HDFS replication set to 3
> each node has Hadoop and hbase and Ubuntu installed on it.. so approx 11
> GB is available for use on my 20 GB node.
>

11 GB is quite small  - or is there a typo?

The amount of raw data is about 115 GB
   *nr of items* *size of an item* *
* *Bytes* *GB*  24 1.00E+006 5 1.02E+003
122880000000 114.4409179688  (without additional key and metadata)

Depending in the amount of overhead this could be about 200GB x 3 is 600GB
just for distributed storage.

And than you need some capacity to store intermediate processing data (20%
to 30%) of the processed data is recommendet.

So you might prepare a capacity of 1TB or even more if your dataset grows.


>
>

> I have no idea, if I have not enabled Hbase replication, is the HDFS
> replication enough to keep the data safe and redundant.
>

The replication on the HDFS level is sufficient for keeping the data safe,
no need to replicate the HBase tables separately.


>  How much total disk space I will need for the storage of the data.
>


>
> Please help me estimate this.
>
> Thank you so much.
>
> --
> Regards,
> Ouch Whisper
> 010101010101
>

Best wishes
Mirko

Re: Estimating disk space requirements

Posted by Mirko Kämpf <mi...@gmail.com>.

Hi,

some comments are inside your message ...


2013/1/18 Panshul Whisper <ou...@gmail.com>

> Hello,
>
> I was estimating how much disk space do I need for my cluster.
>
> I have 24 million JSON documents approx. 5kb each
> the Json is to be stored into HBASE with some identifying data in coloumns
> and I also want to store the Json for later retrieval based on the Id data
> as keys in Hbase.
> I have my HDFS replication set to 3
> each node has Hadoop and hbase and Ubuntu installed on it.. so approx 11
> GB is available for use on my 20 GB node.
>

11 GB is quite small  - or is there a typo?

The amount of raw data is about 115 GB
   *nr of items* *size of an item* *
* *Bytes* *GB*  24 1.00E+006 5 1.02E+003
122880000000 114.4409179688  (without additional key and metadata)

Depending in the amount of overhead this could be about 200GB x 3 is 600GB
just for distributed storage.

And than you need some capacity to store intermediate processing data (20%
to 30%) of the processed data is recommendet.

So you might prepare a capacity of 1TB or even more if your dataset grows.


>
>

> I have no idea, if I have not enabled Hbase replication, is the HDFS
> replication enough to keep the data safe and redundant.
>

The replication on the HDFS level is sufficient for keeping the data safe,
no need to replicate the HBase tables separately.


>  How much total disk space I will need for the storage of the data.
>


>
> Please help me estimate this.
>
> Thank you so much.
>
> --
> Regards,
> Ouch Whisper
> 010101010101
>

Best wishes
Mirko

Re: Estimating disk space requirements

Posted by Mirko Kämpf <mi...@gmail.com>.

Hi,

some comments are inside your message ...


2013/1/18 Panshul Whisper <ou...@gmail.com>

> Hello,
>
> I was estimating how much disk space do I need for my cluster.
>
> I have 24 million JSON documents approx. 5kb each
> the Json is to be stored into HBASE with some identifying data in coloumns
> and I also want to store the Json for later retrieval based on the Id data
> as keys in Hbase.
> I have my HDFS replication set to 3
> each node has Hadoop and hbase and Ubuntu installed on it.. so approx 11
> GB is available for use on my 20 GB node.
>

11 GB is quite small  - or is there a typo?

The amount of raw data is about 115 GB
   *nr of items* *size of an item* *
* *Bytes* *GB*  24 1.00E+006 5 1.02E+003
122880000000 114.4409179688  (without additional key and metadata)

Depending in the amount of overhead this could be about 200GB x 3 is 600GB
just for distributed storage.

And than you need some capacity to store intermediate processing data (20%
to 30%) of the processed data is recommendet.

So you might prepare a capacity of 1TB or even more if your dataset grows.


>
>

> I have no idea, if I have not enabled Hbase replication, is the HDFS
> replication enough to keep the data safe and redundant.
>

The replication on the HDFS level is sufficient for keeping the data safe,
no need to replicate the HBase tables separately.


>  How much total disk space I will need for the storage of the data.
>


>
> Please help me estimate this.
>
> Thank you so much.
>
> --
> Regards,
> Ouch Whisper
> 010101010101
>

Best wishes
Mirko

Re: Estimating disk space requirements

Posted by Mohammad Tariq <do...@gmail.com>.

I have been using AWS since quite sometime and I have
never faced any issue. Personally speaking, I found AWS
really flexible. You get a great deal of flexibility in choosing
services depending upon your requirements.

Warm Regards,
Tariq
https://mtariq.jux.com/
cloudfront.blogspot.com


On Fri, Jan 18, 2013 at 7:54 PM, Panshul Whisper <ou...@gmail.com>wrote:

> Thank you for the reply.
>
> It will be great if someone can suggest, if setting up my cluster on
> Rackspace is good or on Amazon using EC2 servers?
> keeping in mind Amazon services have been having a lot of downtimes...
> My main point of concern is performance and availablitiy.
> My cluster has to be very Highly Available.
>
> Thanks.
>
>
> On Fri, Jan 18, 2013 at 3:12 PM, Jean-Marc Spaggiari <
> jean-marc@spaggiari.org> wrote:
>
>> It all depend what you want to do with this data and the power of each
>> single node. There is no one size fit all rule.
>>
>> The more nodes you have, the more CPU power you will have to process
>> the data... But if you 80GB boxes CPUs are faster than your 40GB boxes
>> CPU ,maybe you should take the 80GB then.
>>
>> If you want to get better advices from the list, you will need to
>> beter define you needs and the nodes you can have.
>>
>> JM
>>
>> 2013/1/18, Panshul Whisper <ou...@gmail.com>:
>> > If we look at it with performance in mind,
>> > is it better to have 20 Nodes with 40 GB HDD
>> > or is it better to have 10 Nodes with 80 GB HDD?
>> >
>> > they are connected on a gigabit LAN
>> >
>> > Thnx
>> >
>> >
>> > On Fri, Jan 18, 2013 at 2:26 PM, Jean-Marc Spaggiari <
>> > jean-marc@spaggiari.org> wrote:
>> >
>> >> 20 nodes with 40 GB will do the work.
>> >>
>> >> After that you will have to consider performances based on your access
>> >> pattern. But that's another story.
>> >>
>> >> JM
>> >>
>> >> 2013/1/18, Panshul Whisper <ou...@gmail.com>:
>> >> > Thank you for the replies,
>> >> >
>> >> > So I take it that I should have atleast 800 GB on total free space on
>> >> > HDFS.. (combined free space of all the nodes connected to the
>> cluster).
>> >> So
>> >> > I can connect 20 nodes having 40 GB of hdd on each node to my
>> cluster.
>> >> Will
>> >> > this be enough for the storage?
>> >> > Please confirm.
>> >> >
>> >> > Thanking You,
>> >> > Regards,
>> >> > Panshul.
>> >> >
>> >> >
>> >> > On Fri, Jan 18, 2013 at 1:36 PM, Jean-Marc Spaggiari <
>> >> > jean-marc@spaggiari.org> wrote:
>> >> >
>> >> >> Hi Panshul,
>> >> >>
>> >> >> If you have 20 GB with a replication factor set to 3, you have only
>> >> >> 6.6GB available, not 11GB. You need to divide the total space by the
>> >> >> replication factor.
>> >> >>
>> >> >> Also, if you store your JSon into HBase, you need to add the key
>> size
>> >> >> to it. If you key is 4 bytes, or 1024 bytes, it makes a difference.
>> >> >>
>> >> >> So roughly, 24 000 000 * 5 * 1024 = 114GB. You don't have the space
>> to
>> >> >> store it. Without including the key size. Even with a replication
>> >> >> factor set to 5 you don't have the space.
>> >> >>
>> >> >> Now, you can add some compression, but even with a lucky factor of
>> 50%
>> >> >> you still don't have the space. You will need something like 90%
>> >> >> compression factor to be able to store this data in your cluster.
>> >> >>
>> >> >> A 1T drive is now less than $100... So you might think about
>> replacing
>> >> >> you 20 GB drives by something bigger.
>> >> >> to reply to your last question, for your data here, you will need AT
>> >> >> LEAST 350GB overall storage. But that's a bare minimum. Don't go
>> under
>> >> >> 500GB.
>> >> >>
>> >> >> IMHO
>> >> >>
>> >> >> JM
>> >> >>
>> >> >> 2013/1/18, Panshul Whisper <ou...@gmail.com>:
>> >> >> > Hello,
>> >> >> >
>> >> >> > I was estimating how much disk space do I need for my cluster.
>> >> >> >
>> >> >> > I have 24 million JSON documents approx. 5kb each
>> >> >> > the Json is to be stored into HBASE with some identifying data in
>> >> >> coloumns
>> >> >> > and I also want to store the Json for later retrieval based on the
>> >> >> > Id
>> >> >> data
>> >> >> > as keys in Hbase.
>> >> >> > I have my HDFS replication set to 3
>> >> >> > each node has Hadoop and hbase and Ubuntu installed on it.. so
>> >> >> > approx
>> >> >> > 11
>> >> >> GB
>> >> >> > is available for use on my 20 GB node.
>> >> >> >
>> >> >> > I have no idea, if I have not enabled Hbase replication, is the
>> HDFS
>> >> >> > replication enough to keep the data safe and redundant.
>> >> >> > How much total disk space I will need for the storage of the data.
>> >> >> >
>> >> >> > Please help me estimate this.
>> >> >> >
>> >> >> > Thank you so much.
>> >> >> >
>> >> >> > --
>> >> >> > Regards,
>> >> >> > Ouch Whisper
>> >> >> > 010101010101
>> >> >> >
>> >> >>
>> >> >
>> >> >
>> >> >
>> >> > --
>> >> > Regards,
>> >> > Ouch Whisper
>> >> > 010101010101
>> >> >
>> >>
>> >
>> >
>> >
>> > --
>> > Regards,
>> > Ouch Whisper
>> > 010101010101
>> >
>>
>
>
>
> --
> Regards,
> Ouch Whisper
> 010101010101
>

Re: Estimating disk space requirements

Posted by Mohammad Tariq <do...@gmail.com>.

I have been using AWS since quite sometime and I have
never faced any issue. Personally speaking, I found AWS
really flexible. You get a great deal of flexibility in choosing
services depending upon your requirements.

Warm Regards,
Tariq
https://mtariq.jux.com/
cloudfront.blogspot.com


On Fri, Jan 18, 2013 at 7:54 PM, Panshul Whisper <ou...@gmail.com>wrote:

> Thank you for the reply.
>
> It will be great if someone can suggest, if setting up my cluster on
> Rackspace is good or on Amazon using EC2 servers?
> keeping in mind Amazon services have been having a lot of downtimes...
> My main point of concern is performance and availablitiy.
> My cluster has to be very Highly Available.
>
> Thanks.
>
>
> On Fri, Jan 18, 2013 at 3:12 PM, Jean-Marc Spaggiari <
> jean-marc@spaggiari.org> wrote:
>
>> It all depend what you want to do with this data and the power of each
>> single node. There is no one size fit all rule.
>>
>> The more nodes you have, the more CPU power you will have to process
>> the data... But if you 80GB boxes CPUs are faster than your 40GB boxes
>> CPU ,maybe you should take the 80GB then.
>>
>> If you want to get better advices from the list, you will need to
>> beter define you needs and the nodes you can have.
>>
>> JM
>>
>> 2013/1/18, Panshul Whisper <ou...@gmail.com>:
>> > If we look at it with performance in mind,
>> > is it better to have 20 Nodes with 40 GB HDD
>> > or is it better to have 10 Nodes with 80 GB HDD?
>> >
>> > they are connected on a gigabit LAN
>> >
>> > Thnx
>> >
>> >
>> > On Fri, Jan 18, 2013 at 2:26 PM, Jean-Marc Spaggiari <
>> > jean-marc@spaggiari.org> wrote:
>> >
>> >> 20 nodes with 40 GB will do the work.
>> >>
>> >> After that you will have to consider performances based on your access
>> >> pattern. But that's another story.
>> >>
>> >> JM
>> >>
>> >> 2013/1/18, Panshul Whisper <ou...@gmail.com>:
>> >> > Thank you for the replies,
>> >> >
>> >> > So I take it that I should have atleast 800 GB on total free space on
>> >> > HDFS.. (combined free space of all the nodes connected to the
>> cluster).
>> >> So
>> >> > I can connect 20 nodes having 40 GB of hdd on each node to my
>> cluster.
>> >> Will
>> >> > this be enough for the storage?
>> >> > Please confirm.
>> >> >
>> >> > Thanking You,
>> >> > Regards,
>> >> > Panshul.
>> >> >
>> >> >
>> >> > On Fri, Jan 18, 2013 at 1:36 PM, Jean-Marc Spaggiari <
>> >> > jean-marc@spaggiari.org> wrote:
>> >> >
>> >> >> Hi Panshul,
>> >> >>
>> >> >> If you have 20 GB with a replication factor set to 3, you have only
>> >> >> 6.6GB available, not 11GB. You need to divide the total space by the
>> >> >> replication factor.
>> >> >>
>> >> >> Also, if you store your JSon into HBase, you need to add the key
>> size
>> >> >> to it. If you key is 4 bytes, or 1024 bytes, it makes a difference.
>> >> >>
>> >> >> So roughly, 24 000 000 * 5 * 1024 = 114GB. You don't have the space
>> to
>> >> >> store it. Without including the key size. Even with a replication
>> >> >> factor set to 5 you don't have the space.
>> >> >>
>> >> >> Now, you can add some compression, but even with a lucky factor of
>> 50%
>> >> >> you still don't have the space. You will need something like 90%
>> >> >> compression factor to be able to store this data in your cluster.
>> >> >>
>> >> >> A 1T drive is now less than $100... So you might think about
>> replacing
>> >> >> you 20 GB drives by something bigger.
>> >> >> to reply to your last question, for your data here, you will need AT
>> >> >> LEAST 350GB overall storage. But that's a bare minimum. Don't go
>> under
>> >> >> 500GB.
>> >> >>
>> >> >> IMHO
>> >> >>
>> >> >> JM
>> >> >>
>> >> >> 2013/1/18, Panshul Whisper <ou...@gmail.com>:
>> >> >> > Hello,
>> >> >> >
>> >> >> > I was estimating how much disk space do I need for my cluster.
>> >> >> >
>> >> >> > I have 24 million JSON documents approx. 5kb each
>> >> >> > the Json is to be stored into HBASE with some identifying data in
>> >> >> coloumns
>> >> >> > and I also want to store the Json for later retrieval based on the
>> >> >> > Id
>> >> >> data
>> >> >> > as keys in Hbase.
>> >> >> > I have my HDFS replication set to 3
>> >> >> > each node has Hadoop and hbase and Ubuntu installed on it.. so
>> >> >> > approx
>> >> >> > 11
>> >> >> GB
>> >> >> > is available for use on my 20 GB node.
>> >> >> >
>> >> >> > I have no idea, if I have not enabled Hbase replication, is the
>> HDFS
>> >> >> > replication enough to keep the data safe and redundant.
>> >> >> > How much total disk space I will need for the storage of the data.
>> >> >> >
>> >> >> > Please help me estimate this.
>> >> >> >
>> >> >> > Thank you so much.
>> >> >> >
>> >> >> > --
>> >> >> > Regards,
>> >> >> > Ouch Whisper
>> >> >> > 010101010101
>> >> >> >
>> >> >>
>> >> >
>> >> >
>> >> >
>> >> > --
>> >> > Regards,
>> >> > Ouch Whisper
>> >> > 010101010101
>> >> >
>> >>
>> >
>> >
>> >
>> > --
>> > Regards,
>> > Ouch Whisper
>> > 010101010101
>> >
>>
>
>
>
> --
> Regards,
> Ouch Whisper
> 010101010101
>

Re: Estimating disk space requirements

Posted by Mohammad Tariq <do...@gmail.com>.

I have been using AWS since quite sometime and I have
never faced any issue. Personally speaking, I found AWS
really flexible. You get a great deal of flexibility in choosing
services depending upon your requirements.

Warm Regards,
Tariq
https://mtariq.jux.com/
cloudfront.blogspot.com


On Fri, Jan 18, 2013 at 7:54 PM, Panshul Whisper <ou...@gmail.com>wrote:

> Thank you for the reply.
>
> It will be great if someone can suggest, if setting up my cluster on
> Rackspace is good or on Amazon using EC2 servers?
> keeping in mind Amazon services have been having a lot of downtimes...
> My main point of concern is performance and availablitiy.
> My cluster has to be very Highly Available.
>
> Thanks.
>
>
> On Fri, Jan 18, 2013 at 3:12 PM, Jean-Marc Spaggiari <
> jean-marc@spaggiari.org> wrote:
>
>> It all depend what you want to do with this data and the power of each
>> single node. There is no one size fit all rule.
>>
>> The more nodes you have, the more CPU power you will have to process
>> the data... But if you 80GB boxes CPUs are faster than your 40GB boxes
>> CPU ,maybe you should take the 80GB then.
>>
>> If you want to get better advices from the list, you will need to
>> beter define you needs and the nodes you can have.
>>
>> JM
>>
>> 2013/1/18, Panshul Whisper <ou...@gmail.com>:
>> > If we look at it with performance in mind,
>> > is it better to have 20 Nodes with 40 GB HDD
>> > or is it better to have 10 Nodes with 80 GB HDD?
>> >
>> > they are connected on a gigabit LAN
>> >
>> > Thnx
>> >
>> >
>> > On Fri, Jan 18, 2013 at 2:26 PM, Jean-Marc Spaggiari <
>> > jean-marc@spaggiari.org> wrote:
>> >
>> >> 20 nodes with 40 GB will do the work.
>> >>
>> >> After that you will have to consider performances based on your access
>> >> pattern. But that's another story.
>> >>
>> >> JM
>> >>
>> >> 2013/1/18, Panshul Whisper <ou...@gmail.com>:
>> >> > Thank you for the replies,
>> >> >
>> >> > So I take it that I should have atleast 800 GB on total free space on
>> >> > HDFS.. (combined free space of all the nodes connected to the
>> cluster).
>> >> So
>> >> > I can connect 20 nodes having 40 GB of hdd on each node to my
>> cluster.
>> >> Will
>> >> > this be enough for the storage?
>> >> > Please confirm.
>> >> >
>> >> > Thanking You,
>> >> > Regards,
>> >> > Panshul.
>> >> >
>> >> >
>> >> > On Fri, Jan 18, 2013 at 1:36 PM, Jean-Marc Spaggiari <
>> >> > jean-marc@spaggiari.org> wrote:
>> >> >
>> >> >> Hi Panshul,
>> >> >>
>> >> >> If you have 20 GB with a replication factor set to 3, you have only
>> >> >> 6.6GB available, not 11GB. You need to divide the total space by the
>> >> >> replication factor.
>> >> >>
>> >> >> Also, if you store your JSon into HBase, you need to add the key
>> size
>> >> >> to it. If you key is 4 bytes, or 1024 bytes, it makes a difference.
>> >> >>
>> >> >> So roughly, 24 000 000 * 5 * 1024 = 114GB. You don't have the space
>> to
>> >> >> store it. Without including the key size. Even with a replication
>> >> >> factor set to 5 you don't have the space.
>> >> >>
>> >> >> Now, you can add some compression, but even with a lucky factor of
>> 50%
>> >> >> you still don't have the space. You will need something like 90%
>> >> >> compression factor to be able to store this data in your cluster.
>> >> >>
>> >> >> A 1T drive is now less than $100... So you might think about
>> replacing
>> >> >> you 20 GB drives by something bigger.
>> >> >> to reply to your last question, for your data here, you will need AT
>> >> >> LEAST 350GB overall storage. But that's a bare minimum. Don't go
>> under
>> >> >> 500GB.
>> >> >>
>> >> >> IMHO
>> >> >>
>> >> >> JM
>> >> >>
>> >> >> 2013/1/18, Panshul Whisper <ou...@gmail.com>:
>> >> >> > Hello,
>> >> >> >
>> >> >> > I was estimating how much disk space do I need for my cluster.
>> >> >> >
>> >> >> > I have 24 million JSON documents approx. 5kb each
>> >> >> > the Json is to be stored into HBASE with some identifying data in
>> >> >> coloumns
>> >> >> > and I also want to store the Json for later retrieval based on the
>> >> >> > Id
>> >> >> data
>> >> >> > as keys in Hbase.
>> >> >> > I have my HDFS replication set to 3
>> >> >> > each node has Hadoop and hbase and Ubuntu installed on it.. so
>> >> >> > approx
>> >> >> > 11
>> >> >> GB
>> >> >> > is available for use on my 20 GB node.
>> >> >> >
>> >> >> > I have no idea, if I have not enabled Hbase replication, is the
>> HDFS
>> >> >> > replication enough to keep the data safe and redundant.
>> >> >> > How much total disk space I will need for the storage of the data.
>> >> >> >
>> >> >> > Please help me estimate this.
>> >> >> >
>> >> >> > Thank you so much.
>> >> >> >
>> >> >> > --
>> >> >> > Regards,
>> >> >> > Ouch Whisper
>> >> >> > 010101010101
>> >> >> >
>> >> >>
>> >> >
>> >> >
>> >> >
>> >> > --
>> >> > Regards,
>> >> > Ouch Whisper
>> >> > 010101010101
>> >> >
>> >>
>> >
>> >
>> >
>> > --
>> > Regards,
>> > Ouch Whisper
>> > 010101010101
>> >
>>
>
>
>
> --
> Regards,
> Ouch Whisper
> 010101010101
>

Re: Estimating disk space requirements

Posted by Mohammad Tariq <do...@gmail.com>.

I have been using AWS since quite sometime and I have
never faced any issue. Personally speaking, I found AWS
really flexible. You get a great deal of flexibility in choosing
services depending upon your requirements.

Warm Regards,
Tariq
https://mtariq.jux.com/
cloudfront.blogspot.com


On Fri, Jan 18, 2013 at 7:54 PM, Panshul Whisper <ou...@gmail.com>wrote:

> Thank you for the reply.
>
> It will be great if someone can suggest, if setting up my cluster on
> Rackspace is good or on Amazon using EC2 servers?
> keeping in mind Amazon services have been having a lot of downtimes...
> My main point of concern is performance and availablitiy.
> My cluster has to be very Highly Available.
>
> Thanks.
>
>
> On Fri, Jan 18, 2013 at 3:12 PM, Jean-Marc Spaggiari <
> jean-marc@spaggiari.org> wrote:
>
>> It all depend what you want to do with this data and the power of each
>> single node. There is no one size fit all rule.
>>
>> The more nodes you have, the more CPU power you will have to process
>> the data... But if you 80GB boxes CPUs are faster than your 40GB boxes
>> CPU ,maybe you should take the 80GB then.
>>
>> If you want to get better advices from the list, you will need to
>> beter define you needs and the nodes you can have.
>>
>> JM
>>
>> 2013/1/18, Panshul Whisper <ou...@gmail.com>:
>> > If we look at it with performance in mind,
>> > is it better to have 20 Nodes with 40 GB HDD
>> > or is it better to have 10 Nodes with 80 GB HDD?
>> >
>> > they are connected on a gigabit LAN
>> >
>> > Thnx
>> >
>> >
>> > On Fri, Jan 18, 2013 at 2:26 PM, Jean-Marc Spaggiari <
>> > jean-marc@spaggiari.org> wrote:
>> >
>> >> 20 nodes with 40 GB will do the work.
>> >>
>> >> After that you will have to consider performances based on your access
>> >> pattern. But that's another story.
>> >>
>> >> JM
>> >>
>> >> 2013/1/18, Panshul Whisper <ou...@gmail.com>:
>> >> > Thank you for the replies,
>> >> >
>> >> > So I take it that I should have atleast 800 GB on total free space on
>> >> > HDFS.. (combined free space of all the nodes connected to the
>> cluster).
>> >> So
>> >> > I can connect 20 nodes having 40 GB of hdd on each node to my
>> cluster.
>> >> Will
>> >> > this be enough for the storage?
>> >> > Please confirm.
>> >> >
>> >> > Thanking You,
>> >> > Regards,
>> >> > Panshul.
>> >> >
>> >> >
>> >> > On Fri, Jan 18, 2013 at 1:36 PM, Jean-Marc Spaggiari <
>> >> > jean-marc@spaggiari.org> wrote:
>> >> >
>> >> >> Hi Panshul,
>> >> >>
>> >> >> If you have 20 GB with a replication factor set to 3, you have only
>> >> >> 6.6GB available, not 11GB. You need to divide the total space by the
>> >> >> replication factor.
>> >> >>
>> >> >> Also, if you store your JSon into HBase, you need to add the key
>> size
>> >> >> to it. If you key is 4 bytes, or 1024 bytes, it makes a difference.
>> >> >>
>> >> >> So roughly, 24 000 000 * 5 * 1024 = 114GB. You don't have the space
>> to
>> >> >> store it. Without including the key size. Even with a replication
>> >> >> factor set to 5 you don't have the space.
>> >> >>
>> >> >> Now, you can add some compression, but even with a lucky factor of
>> 50%
>> >> >> you still don't have the space. You will need something like 90%
>> >> >> compression factor to be able to store this data in your cluster.
>> >> >>
>> >> >> A 1T drive is now less than $100... So you might think about
>> replacing
>> >> >> you 20 GB drives by something bigger.
>> >> >> to reply to your last question, for your data here, you will need AT
>> >> >> LEAST 350GB overall storage. But that's a bare minimum. Don't go
>> under
>> >> >> 500GB.
>> >> >>
>> >> >> IMHO
>> >> >>
>> >> >> JM
>> >> >>
>> >> >> 2013/1/18, Panshul Whisper <ou...@gmail.com>:
>> >> >> > Hello,
>> >> >> >
>> >> >> > I was estimating how much disk space do I need for my cluster.
>> >> >> >
>> >> >> > I have 24 million JSON documents approx. 5kb each
>> >> >> > the Json is to be stored into HBASE with some identifying data in
>> >> >> coloumns
>> >> >> > and I also want to store the Json for later retrieval based on the
>> >> >> > Id
>> >> >> data
>> >> >> > as keys in Hbase.
>> >> >> > I have my HDFS replication set to 3
>> >> >> > each node has Hadoop and hbase and Ubuntu installed on it.. so
>> >> >> > approx
>> >> >> > 11
>> >> >> GB
>> >> >> > is available for use on my 20 GB node.
>> >> >> >
>> >> >> > I have no idea, if I have not enabled Hbase replication, is the
>> HDFS
>> >> >> > replication enough to keep the data safe and redundant.
>> >> >> > How much total disk space I will need for the storage of the data.
>> >> >> >
>> >> >> > Please help me estimate this.
>> >> >> >
>> >> >> > Thank you so much.
>> >> >> >
>> >> >> > --
>> >> >> > Regards,
>> >> >> > Ouch Whisper
>> >> >> > 010101010101
>> >> >> >
>> >> >>
>> >> >
>> >> >
>> >> >
>> >> > --
>> >> > Regards,
>> >> > Ouch Whisper
>> >> > 010101010101
>> >> >
>> >>
>> >
>> >
>> >
>> > --
>> > Regards,
>> > Ouch Whisper
>> > 010101010101
>> >
>>
>
>
>
> --
> Regards,
> Ouch Whisper
> 010101010101
>

Re: Estimating disk space requirements

Posted by Panshul Whisper <ou...@gmail.com>.

Thank you for the reply.

It will be great if someone can suggest, if setting up my cluster on
Rackspace is good or on Amazon using EC2 servers?
keeping in mind Amazon services have been having a lot of downtimes...
My main point of concern is performance and availablitiy.
My cluster has to be very Highly Available.

Thanks.


On Fri, Jan 18, 2013 at 3:12 PM, Jean-Marc Spaggiari <
jean-marc@spaggiari.org> wrote:

> It all depend what you want to do with this data and the power of each
> single node. There is no one size fit all rule.
>
> The more nodes you have, the more CPU power you will have to process
> the data... But if you 80GB boxes CPUs are faster than your 40GB boxes
> CPU ,maybe you should take the 80GB then.
>
> If you want to get better advices from the list, you will need to
> beter define you needs and the nodes you can have.
>
> JM
>
> 2013/1/18, Panshul Whisper <ou...@gmail.com>:
> > If we look at it with performance in mind,
> > is it better to have 20 Nodes with 40 GB HDD
> > or is it better to have 10 Nodes with 80 GB HDD?
> >
> > they are connected on a gigabit LAN
> >
> > Thnx
> >
> >
> > On Fri, Jan 18, 2013 at 2:26 PM, Jean-Marc Spaggiari <
> > jean-marc@spaggiari.org> wrote:
> >
> >> 20 nodes with 40 GB will do the work.
> >>
> >> After that you will have to consider performances based on your access
> >> pattern. But that's another story.
> >>
> >> JM
> >>
> >> 2013/1/18, Panshul Whisper <ou...@gmail.com>:
> >> > Thank you for the replies,
> >> >
> >> > So I take it that I should have atleast 800 GB on total free space on
> >> > HDFS.. (combined free space of all the nodes connected to the
> cluster).
> >> So
> >> > I can connect 20 nodes having 40 GB of hdd on each node to my cluster.
> >> Will
> >> > this be enough for the storage?
> >> > Please confirm.
> >> >
> >> > Thanking You,
> >> > Regards,
> >> > Panshul.
> >> >
> >> >
> >> > On Fri, Jan 18, 2013 at 1:36 PM, Jean-Marc Spaggiari <
> >> > jean-marc@spaggiari.org> wrote:
> >> >
> >> >> Hi Panshul,
> >> >>
> >> >> If you have 20 GB with a replication factor set to 3, you have only
> >> >> 6.6GB available, not 11GB. You need to divide the total space by the
> >> >> replication factor.
> >> >>
> >> >> Also, if you store your JSon into HBase, you need to add the key size
> >> >> to it. If you key is 4 bytes, or 1024 bytes, it makes a difference.
> >> >>
> >> >> So roughly, 24 000 000 * 5 * 1024 = 114GB. You don't have the space
> to
> >> >> store it. Without including the key size. Even with a replication
> >> >> factor set to 5 you don't have the space.
> >> >>
> >> >> Now, you can add some compression, but even with a lucky factor of
> 50%
> >> >> you still don't have the space. You will need something like 90%
> >> >> compression factor to be able to store this data in your cluster.
> >> >>
> >> >> A 1T drive is now less than $100... So you might think about
> replacing
> >> >> you 20 GB drives by something bigger.
> >> >> to reply to your last question, for your data here, you will need AT
> >> >> LEAST 350GB overall storage. But that's a bare minimum. Don't go
> under
> >> >> 500GB.
> >> >>
> >> >> IMHO
> >> >>
> >> >> JM
> >> >>
> >> >> 2013/1/18, Panshul Whisper <ou...@gmail.com>:
> >> >> > Hello,
> >> >> >
> >> >> > I was estimating how much disk space do I need for my cluster.
> >> >> >
> >> >> > I have 24 million JSON documents approx. 5kb each
> >> >> > the Json is to be stored into HBASE with some identifying data in
> >> >> coloumns
> >> >> > and I also want to store the Json for later retrieval based on the
> >> >> > Id
> >> >> data
> >> >> > as keys in Hbase.
> >> >> > I have my HDFS replication set to 3
> >> >> > each node has Hadoop and hbase and Ubuntu installed on it.. so
> >> >> > approx
> >> >> > 11
> >> >> GB
> >> >> > is available for use on my 20 GB node.
> >> >> >
> >> >> > I have no idea, if I have not enabled Hbase replication, is the
> HDFS
> >> >> > replication enough to keep the data safe and redundant.
> >> >> > How much total disk space I will need for the storage of the data.
> >> >> >
> >> >> > Please help me estimate this.
> >> >> >
> >> >> > Thank you so much.
> >> >> >
> >> >> > --
> >> >> > Regards,
> >> >> > Ouch Whisper
> >> >> > 010101010101
> >> >> >
> >> >>
> >> >
> >> >
> >> >
> >> > --
> >> > Regards,
> >> > Ouch Whisper
> >> > 010101010101
> >> >
> >>
> >
> >
> >
> > --
> > Regards,
> > Ouch Whisper
> > 010101010101
> >
>



-- 
Regards,
Ouch Whisper
010101010101

Re: Estimating disk space requirements

Posted by Panshul Whisper <ou...@gmail.com>.

Thank you for the reply.

It will be great if someone can suggest, if setting up my cluster on
Rackspace is good or on Amazon using EC2 servers?
keeping in mind Amazon services have been having a lot of downtimes...
My main point of concern is performance and availablitiy.
My cluster has to be very Highly Available.

Thanks.


On Fri, Jan 18, 2013 at 3:12 PM, Jean-Marc Spaggiari <
jean-marc@spaggiari.org> wrote:

> It all depend what you want to do with this data and the power of each
> single node. There is no one size fit all rule.
>
> The more nodes you have, the more CPU power you will have to process
> the data... But if you 80GB boxes CPUs are faster than your 40GB boxes
> CPU ,maybe you should take the 80GB then.
>
> If you want to get better advices from the list, you will need to
> beter define you needs and the nodes you can have.
>
> JM
>
> 2013/1/18, Panshul Whisper <ou...@gmail.com>:
> > If we look at it with performance in mind,
> > is it better to have 20 Nodes with 40 GB HDD
> > or is it better to have 10 Nodes with 80 GB HDD?
> >
> > they are connected on a gigabit LAN
> >
> > Thnx
> >
> >
> > On Fri, Jan 18, 2013 at 2:26 PM, Jean-Marc Spaggiari <
> > jean-marc@spaggiari.org> wrote:
> >
> >> 20 nodes with 40 GB will do the work.
> >>
> >> After that you will have to consider performances based on your access
> >> pattern. But that's another story.
> >>
> >> JM
> >>
> >> 2013/1/18, Panshul Whisper <ou...@gmail.com>:
> >> > Thank you for the replies,
> >> >
> >> > So I take it that I should have atleast 800 GB on total free space on
> >> > HDFS.. (combined free space of all the nodes connected to the
> cluster).
> >> So
> >> > I can connect 20 nodes having 40 GB of hdd on each node to my cluster.
> >> Will
> >> > this be enough for the storage?
> >> > Please confirm.
> >> >
> >> > Thanking You,
> >> > Regards,
> >> > Panshul.
> >> >
> >> >
> >> > On Fri, Jan 18, 2013 at 1:36 PM, Jean-Marc Spaggiari <
> >> > jean-marc@spaggiari.org> wrote:
> >> >
> >> >> Hi Panshul,
> >> >>
> >> >> If you have 20 GB with a replication factor set to 3, you have only
> >> >> 6.6GB available, not 11GB. You need to divide the total space by the
> >> >> replication factor.
> >> >>
> >> >> Also, if you store your JSon into HBase, you need to add the key size
> >> >> to it. If you key is 4 bytes, or 1024 bytes, it makes a difference.
> >> >>
> >> >> So roughly, 24 000 000 * 5 * 1024 = 114GB. You don't have the space
> to
> >> >> store it. Without including the key size. Even with a replication
> >> >> factor set to 5 you don't have the space.
> >> >>
> >> >> Now, you can add some compression, but even with a lucky factor of
> 50%
> >> >> you still don't have the space. You will need something like 90%
> >> >> compression factor to be able to store this data in your cluster.
> >> >>
> >> >> A 1T drive is now less than $100... So you might think about
> replacing
> >> >> you 20 GB drives by something bigger.
> >> >> to reply to your last question, for your data here, you will need AT
> >> >> LEAST 350GB overall storage. But that's a bare minimum. Don't go
> under
> >> >> 500GB.
> >> >>
> >> >> IMHO
> >> >>
> >> >> JM
> >> >>
> >> >> 2013/1/18, Panshul Whisper <ou...@gmail.com>:
> >> >> > Hello,
> >> >> >
> >> >> > I was estimating how much disk space do I need for my cluster.
> >> >> >
> >> >> > I have 24 million JSON documents approx. 5kb each
> >> >> > the Json is to be stored into HBASE with some identifying data in
> >> >> coloumns
> >> >> > and I also want to store the Json for later retrieval based on the
> >> >> > Id
> >> >> data
> >> >> > as keys in Hbase.
> >> >> > I have my HDFS replication set to 3
> >> >> > each node has Hadoop and hbase and Ubuntu installed on it.. so
> >> >> > approx
> >> >> > 11
> >> >> GB
> >> >> > is available for use on my 20 GB node.
> >> >> >
> >> >> > I have no idea, if I have not enabled Hbase replication, is the
> HDFS
> >> >> > replication enough to keep the data safe and redundant.
> >> >> > How much total disk space I will need for the storage of the data.
> >> >> >
> >> >> > Please help me estimate this.
> >> >> >
> >> >> > Thank you so much.
> >> >> >
> >> >> > --
> >> >> > Regards,
> >> >> > Ouch Whisper
> >> >> > 010101010101
> >> >> >
> >> >>
> >> >
> >> >
> >> >
> >> > --
> >> > Regards,
> >> > Ouch Whisper
> >> > 010101010101
> >> >
> >>
> >
> >
> >
> > --
> > Regards,
> > Ouch Whisper
> > 010101010101
> >
>



-- 
Regards,
Ouch Whisper
010101010101

Re: Estimating disk space requirements

Posted by Panshul Whisper <ou...@gmail.com>.

Thank you for the reply.

It will be great if someone can suggest, if setting up my cluster on
Rackspace is good or on Amazon using EC2 servers?
keeping in mind Amazon services have been having a lot of downtimes...
My main point of concern is performance and availablitiy.
My cluster has to be very Highly Available.

Thanks.


On Fri, Jan 18, 2013 at 3:12 PM, Jean-Marc Spaggiari <
jean-marc@spaggiari.org> wrote:

> It all depend what you want to do with this data and the power of each
> single node. There is no one size fit all rule.
>
> The more nodes you have, the more CPU power you will have to process
> the data... But if you 80GB boxes CPUs are faster than your 40GB boxes
> CPU ,maybe you should take the 80GB then.
>
> If you want to get better advices from the list, you will need to
> beter define you needs and the nodes you can have.
>
> JM
>
> 2013/1/18, Panshul Whisper <ou...@gmail.com>:
> > If we look at it with performance in mind,
> > is it better to have 20 Nodes with 40 GB HDD
> > or is it better to have 10 Nodes with 80 GB HDD?
> >
> > they are connected on a gigabit LAN
> >
> > Thnx
> >
> >
> > On Fri, Jan 18, 2013 at 2:26 PM, Jean-Marc Spaggiari <
> > jean-marc@spaggiari.org> wrote:
> >
> >> 20 nodes with 40 GB will do the work.
> >>
> >> After that you will have to consider performances based on your access
> >> pattern. But that's another story.
> >>
> >> JM
> >>
> >> 2013/1/18, Panshul Whisper <ou...@gmail.com>:
> >> > Thank you for the replies,
> >> >
> >> > So I take it that I should have atleast 800 GB on total free space on
> >> > HDFS.. (combined free space of all the nodes connected to the
> cluster).
> >> So
> >> > I can connect 20 nodes having 40 GB of hdd on each node to my cluster.
> >> Will
> >> > this be enough for the storage?
> >> > Please confirm.
> >> >
> >> > Thanking You,
> >> > Regards,
> >> > Panshul.
> >> >
> >> >
> >> > On Fri, Jan 18, 2013 at 1:36 PM, Jean-Marc Spaggiari <
> >> > jean-marc@spaggiari.org> wrote:
> >> >
> >> >> Hi Panshul,
> >> >>
> >> >> If you have 20 GB with a replication factor set to 3, you have only
> >> >> 6.6GB available, not 11GB. You need to divide the total space by the
> >> >> replication factor.
> >> >>
> >> >> Also, if you store your JSon into HBase, you need to add the key size
> >> >> to it. If you key is 4 bytes, or 1024 bytes, it makes a difference.
> >> >>
> >> >> So roughly, 24 000 000 * 5 * 1024 = 114GB. You don't have the space
> to
> >> >> store it. Without including the key size. Even with a replication
> >> >> factor set to 5 you don't have the space.
> >> >>
> >> >> Now, you can add some compression, but even with a lucky factor of
> 50%
> >> >> you still don't have the space. You will need something like 90%
> >> >> compression factor to be able to store this data in your cluster.
> >> >>
> >> >> A 1T drive is now less than $100... So you might think about
> replacing
> >> >> you 20 GB drives by something bigger.
> >> >> to reply to your last question, for your data here, you will need AT
> >> >> LEAST 350GB overall storage. But that's a bare minimum. Don't go
> under
> >> >> 500GB.
> >> >>
> >> >> IMHO
> >> >>
> >> >> JM
> >> >>
> >> >> 2013/1/18, Panshul Whisper <ou...@gmail.com>:
> >> >> > Hello,
> >> >> >
> >> >> > I was estimating how much disk space do I need for my cluster.
> >> >> >
> >> >> > I have 24 million JSON documents approx. 5kb each
> >> >> > the Json is to be stored into HBASE with some identifying data in
> >> >> coloumns
> >> >> > and I also want to store the Json for later retrieval based on the
> >> >> > Id
> >> >> data
> >> >> > as keys in Hbase.
> >> >> > I have my HDFS replication set to 3
> >> >> > each node has Hadoop and hbase and Ubuntu installed on it.. so
> >> >> > approx
> >> >> > 11
> >> >> GB
> >> >> > is available for use on my 20 GB node.
> >> >> >
> >> >> > I have no idea, if I have not enabled Hbase replication, is the
> HDFS
> >> >> > replication enough to keep the data safe and redundant.
> >> >> > How much total disk space I will need for the storage of the data.
> >> >> >
> >> >> > Please help me estimate this.
> >> >> >
> >> >> > Thank you so much.
> >> >> >
> >> >> > --
> >> >> > Regards,
> >> >> > Ouch Whisper
> >> >> > 010101010101
> >> >> >
> >> >>
> >> >
> >> >
> >> >
> >> > --
> >> > Regards,
> >> > Ouch Whisper
> >> > 010101010101
> >> >
> >>
> >
> >
> >
> > --
> > Regards,
> > Ouch Whisper
> > 010101010101
> >
>



-- 
Regards,
Ouch Whisper
010101010101

Re: Estimating disk space requirements

Posted by Panshul Whisper <ou...@gmail.com>.

Thank you for the reply.

It will be great if someone can suggest, if setting up my cluster on
Rackspace is good or on Amazon using EC2 servers?
keeping in mind Amazon services have been having a lot of downtimes...
My main point of concern is performance and availablitiy.
My cluster has to be very Highly Available.

Thanks.


On Fri, Jan 18, 2013 at 3:12 PM, Jean-Marc Spaggiari <
jean-marc@spaggiari.org> wrote:

> It all depend what you want to do with this data and the power of each
> single node. There is no one size fit all rule.
>
> The more nodes you have, the more CPU power you will have to process
> the data... But if you 80GB boxes CPUs are faster than your 40GB boxes
> CPU ,maybe you should take the 80GB then.
>
> If you want to get better advices from the list, you will need to
> beter define you needs and the nodes you can have.
>
> JM
>
> 2013/1/18, Panshul Whisper <ou...@gmail.com>:
> > If we look at it with performance in mind,
> > is it better to have 20 Nodes with 40 GB HDD
> > or is it better to have 10 Nodes with 80 GB HDD?
> >
> > they are connected on a gigabit LAN
> >
> > Thnx
> >
> >
> > On Fri, Jan 18, 2013 at 2:26 PM, Jean-Marc Spaggiari <
> > jean-marc@spaggiari.org> wrote:
> >
> >> 20 nodes with 40 GB will do the work.
> >>
> >> After that you will have to consider performances based on your access
> >> pattern. But that's another story.
> >>
> >> JM
> >>
> >> 2013/1/18, Panshul Whisper <ou...@gmail.com>:
> >> > Thank you for the replies,
> >> >
> >> > So I take it that I should have atleast 800 GB on total free space on
> >> > HDFS.. (combined free space of all the nodes connected to the
> cluster).
> >> So
> >> > I can connect 20 nodes having 40 GB of hdd on each node to my cluster.
> >> Will
> >> > this be enough for the storage?
> >> > Please confirm.
> >> >
> >> > Thanking You,
> >> > Regards,
> >> > Panshul.
> >> >
> >> >
> >> > On Fri, Jan 18, 2013 at 1:36 PM, Jean-Marc Spaggiari <
> >> > jean-marc@spaggiari.org> wrote:
> >> >
> >> >> Hi Panshul,
> >> >>
> >> >> If you have 20 GB with a replication factor set to 3, you have only
> >> >> 6.6GB available, not 11GB. You need to divide the total space by the
> >> >> replication factor.
> >> >>
> >> >> Also, if you store your JSon into HBase, you need to add the key size
> >> >> to it. If you key is 4 bytes, or 1024 bytes, it makes a difference.
> >> >>
> >> >> So roughly, 24 000 000 * 5 * 1024 = 114GB. You don't have the space
> to
> >> >> store it. Without including the key size. Even with a replication
> >> >> factor set to 5 you don't have the space.
> >> >>
> >> >> Now, you can add some compression, but even with a lucky factor of
> 50%
> >> >> you still don't have the space. You will need something like 90%
> >> >> compression factor to be able to store this data in your cluster.
> >> >>
> >> >> A 1T drive is now less than $100... So you might think about
> replacing
> >> >> you 20 GB drives by something bigger.
> >> >> to reply to your last question, for your data here, you will need AT
> >> >> LEAST 350GB overall storage. But that's a bare minimum. Don't go
> under
> >> >> 500GB.
> >> >>
> >> >> IMHO
> >> >>
> >> >> JM
> >> >>
> >> >> 2013/1/18, Panshul Whisper <ou...@gmail.com>:
> >> >> > Hello,
> >> >> >
> >> >> > I was estimating how much disk space do I need for my cluster.
> >> >> >
> >> >> > I have 24 million JSON documents approx. 5kb each
> >> >> > the Json is to be stored into HBASE with some identifying data in
> >> >> coloumns
> >> >> > and I also want to store the Json for later retrieval based on the
> >> >> > Id
> >> >> data
> >> >> > as keys in Hbase.
> >> >> > I have my HDFS replication set to 3
> >> >> > each node has Hadoop and hbase and Ubuntu installed on it.. so
> >> >> > approx
> >> >> > 11
> >> >> GB
> >> >> > is available for use on my 20 GB node.
> >> >> >
> >> >> > I have no idea, if I have not enabled Hbase replication, is the
> HDFS
> >> >> > replication enough to keep the data safe and redundant.
> >> >> > How much total disk space I will need for the storage of the data.
> >> >> >
> >> >> > Please help me estimate this.
> >> >> >
> >> >> > Thank you so much.
> >> >> >
> >> >> > --
> >> >> > Regards,
> >> >> > Ouch Whisper
> >> >> > 010101010101
> >> >> >
> >> >>
> >> >
> >> >
> >> >
> >> > --
> >> > Regards,
> >> > Ouch Whisper
> >> > 010101010101
> >> >
> >>
> >
> >
> >
> > --
> > Regards,
> > Ouch Whisper
> > 010101010101
> >
>



-- 
Regards,
Ouch Whisper
010101010101

Re: Estimating disk space requirements

Posted by Jean-Marc Spaggiari <je...@spaggiari.org>.

It all depend what you want to do with this data and the power of each
single node. There is no one size fit all rule.

The more nodes you have, the more CPU power you will have to process
the data... But if you 80GB boxes CPUs are faster than your 40GB boxes
CPU ,maybe you should take the 80GB then.

If you want to get better advices from the list, you will need to
beter define you needs and the nodes you can have.

JM

2013/1/18, Panshul Whisper <ou...@gmail.com>:
> If we look at it with performance in mind,
> is it better to have 20 Nodes with 40 GB HDD
> or is it better to have 10 Nodes with 80 GB HDD?
>
> they are connected on a gigabit LAN
>
> Thnx
>
>
> On Fri, Jan 18, 2013 at 2:26 PM, Jean-Marc Spaggiari <
> jean-marc@spaggiari.org> wrote:
>
>> 20 nodes with 40 GB will do the work.
>>
>> After that you will have to consider performances based on your access
>> pattern. But that's another story.
>>
>> JM
>>
>> 2013/1/18, Panshul Whisper <ou...@gmail.com>:
>> > Thank you for the replies,
>> >
>> > So I take it that I should have atleast 800 GB on total free space on
>> > HDFS.. (combined free space of all the nodes connected to the cluster).
>> So
>> > I can connect 20 nodes having 40 GB of hdd on each node to my cluster.
>> Will
>> > this be enough for the storage?
>> > Please confirm.
>> >
>> > Thanking You,
>> > Regards,
>> > Panshul.
>> >
>> >
>> > On Fri, Jan 18, 2013 at 1:36 PM, Jean-Marc Spaggiari <
>> > jean-marc@spaggiari.org> wrote:
>> >
>> >> Hi Panshul,
>> >>
>> >> If you have 20 GB with a replication factor set to 3, you have only
>> >> 6.6GB available, not 11GB. You need to divide the total space by the
>> >> replication factor.
>> >>
>> >> Also, if you store your JSon into HBase, you need to add the key size
>> >> to it. If you key is 4 bytes, or 1024 bytes, it makes a difference.
>> >>
>> >> So roughly, 24 000 000 * 5 * 1024 = 114GB. You don't have the space to
>> >> store it. Without including the key size. Even with a replication
>> >> factor set to 5 you don't have the space.
>> >>
>> >> Now, you can add some compression, but even with a lucky factor of 50%
>> >> you still don't have the space. You will need something like 90%
>> >> compression factor to be able to store this data in your cluster.
>> >>
>> >> A 1T drive is now less than $100... So you might think about replacing
>> >> you 20 GB drives by something bigger.
>> >> to reply to your last question, for your data here, you will need AT
>> >> LEAST 350GB overall storage. But that's a bare minimum. Don't go under
>> >> 500GB.
>> >>
>> >> IMHO
>> >>
>> >> JM
>> >>
>> >> 2013/1/18, Panshul Whisper <ou...@gmail.com>:
>> >> > Hello,
>> >> >
>> >> > I was estimating how much disk space do I need for my cluster.
>> >> >
>> >> > I have 24 million JSON documents approx. 5kb each
>> >> > the Json is to be stored into HBASE with some identifying data in
>> >> coloumns
>> >> > and I also want to store the Json for later retrieval based on the
>> >> > Id
>> >> data
>> >> > as keys in Hbase.
>> >> > I have my HDFS replication set to 3
>> >> > each node has Hadoop and hbase and Ubuntu installed on it.. so
>> >> > approx
>> >> > 11
>> >> GB
>> >> > is available for use on my 20 GB node.
>> >> >
>> >> > I have no idea, if I have not enabled Hbase replication, is the HDFS
>> >> > replication enough to keep the data safe and redundant.
>> >> > How much total disk space I will need for the storage of the data.
>> >> >
>> >> > Please help me estimate this.
>> >> >
>> >> > Thank you so much.
>> >> >
>> >> > --
>> >> > Regards,
>> >> > Ouch Whisper
>> >> > 010101010101
>> >> >
>> >>
>> >
>> >
>> >
>> > --
>> > Regards,
>> > Ouch Whisper
>> > 010101010101
>> >
>>
>
>
>
> --
> Regards,
> Ouch Whisper
> 010101010101
>

Re: Estimating disk space requirements

Posted by Jean-Marc Spaggiari <je...@spaggiari.org>.

It all depend what you want to do with this data and the power of each
single node. There is no one size fit all rule.

The more nodes you have, the more CPU power you will have to process
the data... But if you 80GB boxes CPUs are faster than your 40GB boxes
CPU ,maybe you should take the 80GB then.

If you want to get better advices from the list, you will need to
beter define you needs and the nodes you can have.

JM

2013/1/18, Panshul Whisper <ou...@gmail.com>:
> If we look at it with performance in mind,
> is it better to have 20 Nodes with 40 GB HDD
> or is it better to have 10 Nodes with 80 GB HDD?
>
> they are connected on a gigabit LAN
>
> Thnx
>
>
> On Fri, Jan 18, 2013 at 2:26 PM, Jean-Marc Spaggiari <
> jean-marc@spaggiari.org> wrote:
>
>> 20 nodes with 40 GB will do the work.
>>
>> After that you will have to consider performances based on your access
>> pattern. But that's another story.
>>
>> JM
>>
>> 2013/1/18, Panshul Whisper <ou...@gmail.com>:
>> > Thank you for the replies,
>> >
>> > So I take it that I should have atleast 800 GB on total free space on
>> > HDFS.. (combined free space of all the nodes connected to the cluster).
>> So
>> > I can connect 20 nodes having 40 GB of hdd on each node to my cluster.
>> Will
>> > this be enough for the storage?
>> > Please confirm.
>> >
>> > Thanking You,
>> > Regards,
>> > Panshul.
>> >
>> >
>> > On Fri, Jan 18, 2013 at 1:36 PM, Jean-Marc Spaggiari <
>> > jean-marc@spaggiari.org> wrote:
>> >
>> >> Hi Panshul,
>> >>
>> >> If you have 20 GB with a replication factor set to 3, you have only
>> >> 6.6GB available, not 11GB. You need to divide the total space by the
>> >> replication factor.
>> >>
>> >> Also, if you store your JSon into HBase, you need to add the key size
>> >> to it. If you key is 4 bytes, or 1024 bytes, it makes a difference.
>> >>
>> >> So roughly, 24 000 000 * 5 * 1024 = 114GB. You don't have the space to
>> >> store it. Without including the key size. Even with a replication
>> >> factor set to 5 you don't have the space.
>> >>
>> >> Now, you can add some compression, but even with a lucky factor of 50%
>> >> you still don't have the space. You will need something like 90%
>> >> compression factor to be able to store this data in your cluster.
>> >>
>> >> A 1T drive is now less than $100... So you might think about replacing
>> >> you 20 GB drives by something bigger.
>> >> to reply to your last question, for your data here, you will need AT
>> >> LEAST 350GB overall storage. But that's a bare minimum. Don't go under
>> >> 500GB.
>> >>
>> >> IMHO
>> >>
>> >> JM
>> >>
>> >> 2013/1/18, Panshul Whisper <ou...@gmail.com>:
>> >> > Hello,
>> >> >
>> >> > I was estimating how much disk space do I need for my cluster.
>> >> >
>> >> > I have 24 million JSON documents approx. 5kb each
>> >> > the Json is to be stored into HBASE with some identifying data in
>> >> coloumns
>> >> > and I also want to store the Json for later retrieval based on the
>> >> > Id
>> >> data
>> >> > as keys in Hbase.
>> >> > I have my HDFS replication set to 3
>> >> > each node has Hadoop and hbase and Ubuntu installed on it.. so
>> >> > approx
>> >> > 11
>> >> GB
>> >> > is available for use on my 20 GB node.
>> >> >
>> >> > I have no idea, if I have not enabled Hbase replication, is the HDFS
>> >> > replication enough to keep the data safe and redundant.
>> >> > How much total disk space I will need for the storage of the data.
>> >> >
>> >> > Please help me estimate this.
>> >> >
>> >> > Thank you so much.
>> >> >
>> >> > --
>> >> > Regards,
>> >> > Ouch Whisper
>> >> > 010101010101
>> >> >
>> >>
>> >
>> >
>> >
>> > --
>> > Regards,
>> > Ouch Whisper
>> > 010101010101
>> >
>>
>
>
>
> --
> Regards,
> Ouch Whisper
> 010101010101
>

Re: Estimating disk space requirements

Posted by Mohammad Tariq <do...@gmail.com>.

You can attach a separate disk to your instance (for example an
EBS volume in case of AWS), where you will be storing only
Hadoop related stuff. And one disk for OS related stuff.

Warm Regards,
Tariq
https://mtariq.jux.com/
cloudfront.blogspot.com


On Sat, Jan 19, 2013 at 4:00 AM, Panshul Whisper <ou...@gmail.com>wrote:

> Thnx for the reply Ted,
>
> You can find 40 GB disks when u make virtual nodes on a cloud like
> Rackspace ;-)
>
> About the os partitions I did not exactly understand what you meant.
> I have made a server on the cloud.. And I just installed and configured
> hadoop and hbase in the /use/local folder.
> And I am pretty sure it does not have a separate partition for root.
>
> Please help me explain what u meant and what else precautions should I
> take.
>
> Thanks,
>
> Regards,
> Ouch Whisper
> 01010101010
> On Jan 18, 2013 11:11 PM, "Ted Dunning" <td...@maprtech.com> wrote:
>
>> Where do you find 40gb disks now a days?
>>
>> Normally your performance is going to be better with more space but your
>> network may be your limiting factor for some computations.  That could give
>> you some paradoxical scaling.  Hbase will rarely show this behavior.
>>
>> Keep in mind you also want to allow for an os partition. Current standard
>> practice is to reserve as much as 100 GB for that partition but in your
>> case 10gb better:-)
>>
>> Note that if you account for this, the node counts don't scale as simply.
>>  The overhead of these os partitions goes up with number of nodes.
>>
>> On Jan 18, 2013, at 8:55 AM, Panshul Whisper <ou...@gmail.com>
>> wrote:
>>
>> If we look at it with performance in mind,
>> is it better to have 20 Nodes with 40 GB HDD
>> or is it better to have 10 Nodes with 80 GB HDD?
>>
>> they are connected on a gigabit LAN
>>
>> Thnx
>>
>>
>> On Fri, Jan 18, 2013 at 2:26 PM, Jean-Marc Spaggiari <
>> jean-marc@spaggiari.org> wrote:
>>
>>> 20 nodes with 40 GB will do the work.
>>>
>>> After that you will have to consider performances based on your access
>>> pattern. But that's another story.
>>>
>>> JM
>>>
>>> 2013/1/18, Panshul Whisper <ou...@gmail.com>:
>>> > Thank you for the replies,
>>> >
>>> > So I take it that I should have atleast 800 GB on total free space on
>>> > HDFS.. (combined free space of all the nodes connected to the
>>> cluster). So
>>> > I can connect 20 nodes having 40 GB of hdd on each node to my cluster.
>>> Will
>>> > this be enough for the storage?
>>> > Please confirm.
>>> >
>>> > Thanking You,
>>> > Regards,
>>> > Panshul.
>>> >
>>> >
>>> > On Fri, Jan 18, 2013 at 1:36 PM, Jean-Marc Spaggiari <
>>> > jean-marc@spaggiari.org> wrote:
>>> >
>>> >> Hi Panshul,
>>> >>
>>> >> If you have 20 GB with a replication factor set to 3, you have only
>>> >> 6.6GB available, not 11GB. You need to divide the total space by the
>>> >> replication factor.
>>> >>
>>> >> Also, if you store your JSon into HBase, you need to add the key size
>>> >> to it. If you key is 4 bytes, or 1024 bytes, it makes a difference.
>>> >>
>>> >> So roughly, 24 000 000 * 5 * 1024 = 114GB. You don't have the space to
>>> >> store it. Without including the key size. Even with a replication
>>> >> factor set to 5 you don't have the space.
>>> >>
>>> >> Now, you can add some compression, but even with a lucky factor of 50%
>>> >> you still don't have the space. You will need something like 90%
>>> >> compression factor to be able to store this data in your cluster.
>>> >>
>>> >> A 1T drive is now less than $100... So you might think about replacing
>>> >> you 20 GB drives by something bigger.
>>> >> to reply to your last question, for your data here, you will need AT
>>> >> LEAST 350GB overall storage. But that's a bare minimum. Don't go under
>>> >> 500GB.
>>> >>
>>> >> IMHO
>>> >>
>>> >> JM
>>> >>
>>> >> 2013/1/18, Panshul Whisper <ou...@gmail.com>:
>>> >> > Hello,
>>> >> >
>>> >> > I was estimating how much disk space do I need for my cluster.
>>> >> >
>>> >> > I have 24 million JSON documents approx. 5kb each
>>> >> > the Json is to be stored into HBASE with some identifying data in
>>> >> coloumns
>>> >> > and I also want to store the Json for later retrieval based on the
>>> Id
>>> >> data
>>> >> > as keys in Hbase.
>>> >> > I have my HDFS replication set to 3
>>> >> > each node has Hadoop and hbase and Ubuntu installed on it.. so
>>> approx
>>> >> > 11
>>> >> GB
>>> >> > is available for use on my 20 GB node.
>>> >> >
>>> >> > I have no idea, if I have not enabled Hbase replication, is the HDFS
>>> >> > replication enough to keep the data safe and redundant.
>>> >> > How much total disk space I will need for the storage of the data.
>>> >> >
>>> >> > Please help me estimate this.
>>> >> >
>>> >> > Thank you so much.
>>> >> >
>>> >> > --
>>> >> > Regards,
>>> >> > Ouch Whisper
>>> >> > 010101010101
>>> >> >
>>> >>
>>> >
>>> >
>>> >
>>> > --
>>> > Regards,
>>> > Ouch Whisper
>>> > 010101010101
>>> >
>>>
>>
>>
>>
>> --
>> Regards,
>> Ouch Whisper
>> 010101010101
>>
>>

Re: Estimating disk space requirements

Posted by Ted Dunning <td...@maprtech.com>.

Jeff makes some good points here.

On Fri, Jan 18, 2013 at 5:01 PM, Jeffrey Buell <jb...@vmware.com> wrote:

> I disagree.  There are some significant advantages to using "many small
> nodes" instead of "few big nodes".  As Ted points out, there are some
> disadvantages as well, so you have to look at the trade-offs.  But consider:
>
> - NUMA:  If your hadoop nodes span physical NUMA nodes, then performance
> will suffer from remote memory accesses.  The Linux scheduler tries to
> minimize this, but I've found that about 1/3 of memory accesses are remote
> on a 2-socket machine.  This effect will be more severe on bigger
> machines.  Hadoop nodes that fit on a NUMA node will have not access remote
> memory at all (at least on vSphere).
>

This is definitely a good point with respect to untainted Hadoop, but with
a system like MapR, there is a significant amount of core locality that
goes on to minimize NUMA-remote fetches.  This can have significant impact,
of course.

- Disk partitioning:  Smaller nodes with fewer disks each can significantly
> increase average disk utilization, not decrease it.  Having many threads
> operating against many disks in the "big node" case tends to leave some
> disks idle while others are over-subscribed.
>

Again, this is an implementation side-effect.  Good I/O scheduling and
proper striping can mitigate this substantially.

Going the other way, splitting disks between different VM's can be
disastrous.

>  Partitioning disks among nodes decreases this effect.  The extreme case
> is one disk per node, where no disks will be idle as long as there is work
> to do.
>

Yes.  Even deficient implementations should succeed in this case.

You do lose the ability to allow big-memory jobs that would otherwise span
multiple slots.

> - Management: Not a performance effect, but smaller nodes enable easier
> multi-tenancy, multiple virtual Hadoop clusters, sharing physical hardware
> with other workloads, etc.
>

Definitely true.

Re: Estimating disk space requirements

Posted by Ted Dunning <td...@maprtech.com>.

Jeff makes some good points here.

On Fri, Jan 18, 2013 at 5:01 PM, Jeffrey Buell <jb...@vmware.com> wrote:

> I disagree.  There are some significant advantages to using "many small
> nodes" instead of "few big nodes".  As Ted points out, there are some
> disadvantages as well, so you have to look at the trade-offs.  But consider:
>
> - NUMA:  If your hadoop nodes span physical NUMA nodes, then performance
> will suffer from remote memory accesses.  The Linux scheduler tries to
> minimize this, but I've found that about 1/3 of memory accesses are remote
> on a 2-socket machine.  This effect will be more severe on bigger
> machines.  Hadoop nodes that fit on a NUMA node will have not access remote
> memory at all (at least on vSphere).
>

This is definitely a good point with respect to untainted Hadoop, but with
a system like MapR, there is a significant amount of core locality that
goes on to minimize NUMA-remote fetches.  This can have significant impact,
of course.

- Disk partitioning:  Smaller nodes with fewer disks each can significantly
> increase average disk utilization, not decrease it.  Having many threads
> operating against many disks in the "big node" case tends to leave some
> disks idle while others are over-subscribed.
>

Again, this is an implementation side-effect.  Good I/O scheduling and
proper striping can mitigate this substantially.

Going the other way, splitting disks between different VM's can be
disastrous.

>  Partitioning disks among nodes decreases this effect.  The extreme case
> is one disk per node, where no disks will be idle as long as there is work
> to do.
>

Yes.  Even deficient implementations should succeed in this case.

You do lose the ability to allow big-memory jobs that would otherwise span
multiple slots.

> - Management: Not a performance effect, but smaller nodes enable easier
> multi-tenancy, multiple virtual Hadoop clusters, sharing physical hardware
> with other workloads, etc.
>

Definitely true.

Re: Estimating disk space requirements

Posted by Ted Dunning <td...@maprtech.com>.

Jeff makes some good points here.

On Fri, Jan 18, 2013 at 5:01 PM, Jeffrey Buell <jb...@vmware.com> wrote:

> I disagree.  There are some significant advantages to using "many small
> nodes" instead of "few big nodes".  As Ted points out, there are some
> disadvantages as well, so you have to look at the trade-offs.  But consider:
>
> - NUMA:  If your hadoop nodes span physical NUMA nodes, then performance
> will suffer from remote memory accesses.  The Linux scheduler tries to
> minimize this, but I've found that about 1/3 of memory accesses are remote
> on a 2-socket machine.  This effect will be more severe on bigger
> machines.  Hadoop nodes that fit on a NUMA node will have not access remote
> memory at all (at least on vSphere).
>

This is definitely a good point with respect to untainted Hadoop, but with
a system like MapR, there is a significant amount of core locality that
goes on to minimize NUMA-remote fetches.  This can have significant impact,
of course.

- Disk partitioning:  Smaller nodes with fewer disks each can significantly
> increase average disk utilization, not decrease it.  Having many threads
> operating against many disks in the "big node" case tends to leave some
> disks idle while others are over-subscribed.
>

Again, this is an implementation side-effect.  Good I/O scheduling and
proper striping can mitigate this substantially.

Going the other way, splitting disks between different VM's can be
disastrous.

>  Partitioning disks among nodes decreases this effect.  The extreme case
> is one disk per node, where no disks will be idle as long as there is work
> to do.
>

Yes.  Even deficient implementations should succeed in this case.

You do lose the ability to allow big-memory jobs that would otherwise span
multiple slots.

> - Management: Not a performance effect, but smaller nodes enable easier
> multi-tenancy, multiple virtual Hadoop clusters, sharing physical hardware
> with other workloads, etc.
>

Definitely true.

Re: Estimating disk space requirements

Posted by Ted Dunning <td...@maprtech.com>.

Jeff makes some good points here.

On Fri, Jan 18, 2013 at 5:01 PM, Jeffrey Buell <jb...@vmware.com> wrote:

> I disagree.  There are some significant advantages to using "many small
> nodes" instead of "few big nodes".  As Ted points out, there are some
> disadvantages as well, so you have to look at the trade-offs.  But consider:
>
> - NUMA:  If your hadoop nodes span physical NUMA nodes, then performance
> will suffer from remote memory accesses.  The Linux scheduler tries to
> minimize this, but I've found that about 1/3 of memory accesses are remote
> on a 2-socket machine.  This effect will be more severe on bigger
> machines.  Hadoop nodes that fit on a NUMA node will have not access remote
> memory at all (at least on vSphere).
>

This is definitely a good point with respect to untainted Hadoop, but with
a system like MapR, there is a significant amount of core locality that
goes on to minimize NUMA-remote fetches.  This can have significant impact,
of course.

- Disk partitioning:  Smaller nodes with fewer disks each can significantly
> increase average disk utilization, not decrease it.  Having many threads
> operating against many disks in the "big node" case tends to leave some
> disks idle while others are over-subscribed.
>

Again, this is an implementation side-effect.  Good I/O scheduling and
proper striping can mitigate this substantially.

Going the other way, splitting disks between different VM's can be
disastrous.

>  Partitioning disks among nodes decreases this effect.  The extreme case
> is one disk per node, where no disks will be idle as long as there is work
> to do.
>

Yes.  Even deficient implementations should succeed in this case.

You do lose the ability to allow big-memory jobs that would otherwise span
multiple slots.

> - Management: Not a performance effect, but smaller nodes enable easier
> multi-tenancy, multiple virtual Hadoop clusters, sharing physical hardware
> with other workloads, etc.
>

Definitely true.

Re: Estimating disk space requirements

Posted by Jeffrey Buell <jb...@vmware.com>.

I disagree. There are some significant advantages to using "many small nodes" instead of "few big nodes". As Ted points out, there are some disadvantages as well, so you have to look at the trade-offs. But consider: 

- NUMA: If your hadoop nodes span physical NUMA nodes, then performance will suffer from remote memory accesses. The Linux scheduler tries to minimize this, but I've found that about 1/3 of memory accesses are remote on a 2-socket machine. This effect will be more severe on bigger machines. Hadoop nodes that fit on a NUMA node will have not access remote memory at all (at least on vSphere). 

- Disk partitioning: Smaller nodes with fewer disks each can significantly increase average disk utilization, not decrease it. Having many threads operating against many disks in the "big node" case tends to leave some disks idle while others are over-subscribed. Partitioning disks among nodes decreases this effect. The extreme case is one disk per node, where no disks will be idle as long as there is work to do. 

- Management: Not a performance effect, but smaller nodes enable easier multi-tenancy, multiple virtual Hadoop clusters, sharing physical hardware with other workloads, etc. 

Jeff 

----- Original Message -----

From: "Ted Dunning" <td...@maprtech.com> 
To: user@hadoop.apache.org 
Sent: Friday, January 18, 2013 3:36:30 PM 
Subject: Re: Estimating disk space requirements 

If you make 20 individual small servers, that isn't much different from 20 from one server. The only difference would be if the neighbors of the separate VMs use less resource. 

On Fri, Jan 18, 2013 at 3:34 PM, Panshul Whisper < ouchwhisper@gmail.com > wrote: 

ah now i understand what you mean. 
I will be creating 20 individual servers on the cloud, and not create one big server and make several virtual nodes inside it. 
I will be paying for 20 different nodes.. all configured with hadoop and connected to the cluster. 

Thanx for the intel :) 

On Fri, Jan 18, 2013 at 11:59 PM, Ted Dunning < tdunning@maprtech.com > wrote: 

<blockquote>
It is usually better to not subdivide nodes into virtual nodes. You will generally get better performance form the original node because you only pay for the OS once and because your disk I/O will be scheduled better. 

If you look at EC2 pricing, however, the spot market often has arbitrage opportunities where one size node is absurdly cheap relative to others. In that case, it pays to scale the individual nodes up or down. 

The only reasonable reason to split nodes to very small levels is for testing and training. 

On Fri, Jan 18, 2013 at 2:30 PM, Panshul Whisper < ouchwhisper@gmail.com > wrote: 

<blockquote>

Thnx for the reply Ted, 
You can find 40 GB disks when u make virtual nodes on a cloud like Rackspace ;-) 
About the os partitions I did not exactly understand what you meant. 
I have made a server on the cloud.. And I just installed and configured hadoop and hbase in the /use/local folder. 
And I am pretty sure it does not have a separate partition for root. 
Please help me explain what u meant and what else precautions should I take. 
Thanks, 
Regards, 
Ouch Whisper 
01010101010 

On Jan 18, 2013 11:11 PM, "Ted Dunning" < tdunning@maprtech.com > wrote: 

<blockquote>

Where do you find 40gb disks now a days? 

Normally your performance is going to be better with more space but your network may be your limiting factor for some computations. That could give you some paradoxical scaling. Hbase will rarely show this behavior. 

Keep in mind you also want to allow for an os partition. Current standard practice is to reserve as much as 100 GB for that partition but in your case 10gb better:-) 

Note that if you account for this, the node counts don't scale as simply. The overhead of these os partitions goes up with number of nodes. 

On Jan 18, 2013, at 8:55 AM, Panshul Whisper < ouchwhisper@gmail.com > wrote: 

<blockquote>

If we look at it with performance in mind, 
is it better to have 20 Nodes with 40 GB HDD 
or is it better to have 10 Nodes with 80 GB HDD? 

they are connected on a gigabit LAN 

Thnx 

On Fri, Jan 18, 2013 at 2:26 PM, Jean-Marc Spaggiari < jean-marc@spaggiari.org > wrote: 

<blockquote>
20 nodes with 40 GB will do the work. 

After that you will have to consider performances based on your access 
pattern. But that's another story. 

JM 

2013/1/18, Panshul Whisper < ouchwhisper@gmail.com >: 
> Thank you for the replies, 
> 
> So I take it that I should have atleast 800 GB on total free space on 
> HDFS.. (combined free space of all the nodes connected to the cluster). So 
> I can connect 20 nodes having 40 GB of hdd on each node to my cluster. Will 
> this be enough for the storage? 
> Please confirm. 
> 
> Thanking You, 
> Regards, 
> Panshul. 
> 
> 
> On Fri, Jan 18, 2013 at 1:36 PM, Jean-Marc Spaggiari < 
> jean-marc@spaggiari.org > wrote: 
> 
>> Hi Panshul, 
>> 
>> If you have 20 GB with a replication factor set to 3, you have only 
>> 6.6GB available, not 11GB. You need to divide the total space by the 
>> replication factor. 
>> 
>> Also, if you store your JSon into HBase, you need to add the key size 
>> to it. If you key is 4 bytes, or 1024 bytes, it makes a difference. 
>> 
>> So roughly, 24 000 000 * 5 * 1024 = 114GB. You don't have the space to 
>> store it. Without including the key size. Even with a replication 
>> factor set to 5 you don't have the space. 
>> 
>> Now, you can add some compression, but even with a lucky factor of 50% 
>> you still don't have the space. You will need something like 90% 
>> compression factor to be able to store this data in your cluster. 
>> 
>> A 1T drive is now less than $100... So you might think about replacing 
>> you 20 GB drives by something bigger. 
>> to reply to your last question, for your data here, you will need AT 
>> LEAST 350GB overall storage. But that's a bare minimum. Don't go under 
>> 500GB. 
>> 
>> IMHO 
>> 
>> JM 
>> 
>> 2013/1/18, Panshul Whisper < ouchwhisper@gmail.com >: 
>> > Hello, 
>> > 
>> > I was estimating how much disk space do I need for my cluster. 
>> > 
>> > I have 24 million JSON documents approx. 5kb each 
>> > the Json is to be stored into HBASE with some identifying data in 
>> coloumns 
>> > and I also want to store the Json for later retrieval based on the Id 
>> data 
>> > as keys in Hbase. 
>> > I have my HDFS replication set to 3 
>> > each node has Hadoop and hbase and Ubuntu installed on it.. so approx 
>> > 11 
>> GB 
>> > is available for use on my 20 GB node. 
>> > 
>> > I have no idea, if I have not enabled Hbase replication, is the HDFS 
>> > replication enough to keep the data safe and redundant. 
>> > How much total disk space I will need for the storage of the data. 
>> > 
>> > Please help me estimate this. 
>> > 
>> > Thank you so much. 
>> > 
>> > -- 
>> > Regards, 
>> > Ouch Whisper 
>> > 010101010101 
>> > 
>> 
> 
> 
> 
> -- 
> Regards, 
> Ouch Whisper 
> 010101010101 
> 

-- 

Regards, Ouch Whisper 
010101010101 
</blockquote>

</blockquote>

</blockquote>

</blockquote>

-- 

Regards, Ouch Whisper 
010101010101 
</blockquote>

Re: Estimating disk space requirements

Posted by Jeffrey Buell <jb...@vmware.com>.

I disagree. There are some significant advantages to using "many small nodes" instead of "few big nodes". As Ted points out, there are some disadvantages as well, so you have to look at the trade-offs. But consider: 

- NUMA: If your hadoop nodes span physical NUMA nodes, then performance will suffer from remote memory accesses. The Linux scheduler tries to minimize this, but I've found that about 1/3 of memory accesses are remote on a 2-socket machine. This effect will be more severe on bigger machines. Hadoop nodes that fit on a NUMA node will have not access remote memory at all (at least on vSphere). 

- Disk partitioning: Smaller nodes with fewer disks each can significantly increase average disk utilization, not decrease it. Having many threads operating against many disks in the "big node" case tends to leave some disks idle while others are over-subscribed. Partitioning disks among nodes decreases this effect. The extreme case is one disk per node, where no disks will be idle as long as there is work to do. 

- Management: Not a performance effect, but smaller nodes enable easier multi-tenancy, multiple virtual Hadoop clusters, sharing physical hardware with other workloads, etc. 

Jeff 

----- Original Message -----

From: "Ted Dunning" <td...@maprtech.com> 
To: user@hadoop.apache.org 
Sent: Friday, January 18, 2013 3:36:30 PM 
Subject: Re: Estimating disk space requirements 

If you make 20 individual small servers, that isn't much different from 20 from one server. The only difference would be if the neighbors of the separate VMs use less resource. 

On Fri, Jan 18, 2013 at 3:34 PM, Panshul Whisper < ouchwhisper@gmail.com > wrote: 

ah now i understand what you mean. 
I will be creating 20 individual servers on the cloud, and not create one big server and make several virtual nodes inside it. 
I will be paying for 20 different nodes.. all configured with hadoop and connected to the cluster. 

Thanx for the intel :) 

On Fri, Jan 18, 2013 at 11:59 PM, Ted Dunning < tdunning@maprtech.com > wrote: 

<blockquote>
It is usually better to not subdivide nodes into virtual nodes. You will generally get better performance form the original node because you only pay for the OS once and because your disk I/O will be scheduled better. 

If you look at EC2 pricing, however, the spot market often has arbitrage opportunities where one size node is absurdly cheap relative to others. In that case, it pays to scale the individual nodes up or down. 

The only reasonable reason to split nodes to very small levels is for testing and training. 

On Fri, Jan 18, 2013 at 2:30 PM, Panshul Whisper < ouchwhisper@gmail.com > wrote: 

<blockquote>

Thnx for the reply Ted, 
You can find 40 GB disks when u make virtual nodes on a cloud like Rackspace ;-) 
About the os partitions I did not exactly understand what you meant. 
I have made a server on the cloud.. And I just installed and configured hadoop and hbase in the /use/local folder. 
And I am pretty sure it does not have a separate partition for root. 
Please help me explain what u meant and what else precautions should I take. 
Thanks, 
Regards, 
Ouch Whisper 
01010101010 

On Jan 18, 2013 11:11 PM, "Ted Dunning" < tdunning@maprtech.com > wrote: 

<blockquote>

Where do you find 40gb disks now a days? 

Normally your performance is going to be better with more space but your network may be your limiting factor for some computations. That could give you some paradoxical scaling. Hbase will rarely show this behavior. 

Keep in mind you also want to allow for an os partition. Current standard practice is to reserve as much as 100 GB for that partition but in your case 10gb better:-) 

Note that if you account for this, the node counts don't scale as simply. The overhead of these os partitions goes up with number of nodes. 

On Jan 18, 2013, at 8:55 AM, Panshul Whisper < ouchwhisper@gmail.com > wrote: 

<blockquote>

If we look at it with performance in mind, 
is it better to have 20 Nodes with 40 GB HDD 
or is it better to have 10 Nodes with 80 GB HDD? 

they are connected on a gigabit LAN 

Thnx 

On Fri, Jan 18, 2013 at 2:26 PM, Jean-Marc Spaggiari < jean-marc@spaggiari.org > wrote: 

<blockquote>
20 nodes with 40 GB will do the work. 

After that you will have to consider performances based on your access 
pattern. But that's another story. 

JM 

2013/1/18, Panshul Whisper < ouchwhisper@gmail.com >: 
> Thank you for the replies, 
> 
> So I take it that I should have atleast 800 GB on total free space on 
> HDFS.. (combined free space of all the nodes connected to the cluster). So 
> I can connect 20 nodes having 40 GB of hdd on each node to my cluster. Will 
> this be enough for the storage? 
> Please confirm. 
> 
> Thanking You, 
> Regards, 
> Panshul. 
> 
> 
> On Fri, Jan 18, 2013 at 1:36 PM, Jean-Marc Spaggiari < 
> jean-marc@spaggiari.org > wrote: 
> 
>> Hi Panshul, 
>> 
>> If you have 20 GB with a replication factor set to 3, you have only 
>> 6.6GB available, not 11GB. You need to divide the total space by the 
>> replication factor. 
>> 
>> Also, if you store your JSon into HBase, you need to add the key size 
>> to it. If you key is 4 bytes, or 1024 bytes, it makes a difference. 
>> 
>> So roughly, 24 000 000 * 5 * 1024 = 114GB. You don't have the space to 
>> store it. Without including the key size. Even with a replication 
>> factor set to 5 you don't have the space. 
>> 
>> Now, you can add some compression, but even with a lucky factor of 50% 
>> you still don't have the space. You will need something like 90% 
>> compression factor to be able to store this data in your cluster. 
>> 
>> A 1T drive is now less than $100... So you might think about replacing 
>> you 20 GB drives by something bigger. 
>> to reply to your last question, for your data here, you will need AT 
>> LEAST 350GB overall storage. But that's a bare minimum. Don't go under 
>> 500GB. 
>> 
>> IMHO 
>> 
>> JM 
>> 
>> 2013/1/18, Panshul Whisper < ouchwhisper@gmail.com >: 
>> > Hello, 
>> > 
>> > I was estimating how much disk space do I need for my cluster. 
>> > 
>> > I have 24 million JSON documents approx. 5kb each 
>> > the Json is to be stored into HBASE with some identifying data in 
>> coloumns 
>> > and I also want to store the Json for later retrieval based on the Id 
>> data 
>> > as keys in Hbase. 
>> > I have my HDFS replication set to 3 
>> > each node has Hadoop and hbase and Ubuntu installed on it.. so approx 
>> > 11 
>> GB 
>> > is available for use on my 20 GB node. 
>> > 
>> > I have no idea, if I have not enabled Hbase replication, is the HDFS 
>> > replication enough to keep the data safe and redundant. 
>> > How much total disk space I will need for the storage of the data. 
>> > 
>> > Please help me estimate this. 
>> > 
>> > Thank you so much. 
>> > 
>> > -- 
>> > Regards, 
>> > Ouch Whisper 
>> > 010101010101 
>> > 
>> 
> 
> 
> 
> -- 
> Regards, 
> Ouch Whisper 
> 010101010101 
> 

-- 

Regards, Ouch Whisper 
010101010101 
</blockquote>

</blockquote>

</blockquote>

</blockquote>

-- 

Regards, Ouch Whisper 
010101010101 
</blockquote>

Re: Estimating disk space requirements

Posted by Jeffrey Buell <jb...@vmware.com>.

I disagree. There are some significant advantages to using "many small nodes" instead of "few big nodes". As Ted points out, there are some disadvantages as well, so you have to look at the trade-offs. But consider: 

- NUMA: If your hadoop nodes span physical NUMA nodes, then performance will suffer from remote memory accesses. The Linux scheduler tries to minimize this, but I've found that about 1/3 of memory accesses are remote on a 2-socket machine. This effect will be more severe on bigger machines. Hadoop nodes that fit on a NUMA node will have not access remote memory at all (at least on vSphere). 

- Disk partitioning: Smaller nodes with fewer disks each can significantly increase average disk utilization, not decrease it. Having many threads operating against many disks in the "big node" case tends to leave some disks idle while others are over-subscribed. Partitioning disks among nodes decreases this effect. The extreme case is one disk per node, where no disks will be idle as long as there is work to do. 

- Management: Not a performance effect, but smaller nodes enable easier multi-tenancy, multiple virtual Hadoop clusters, sharing physical hardware with other workloads, etc. 

Jeff 

----- Original Message -----

From: "Ted Dunning" <td...@maprtech.com> 
To: user@hadoop.apache.org 
Sent: Friday, January 18, 2013 3:36:30 PM 
Subject: Re: Estimating disk space requirements 

If you make 20 individual small servers, that isn't much different from 20 from one server. The only difference would be if the neighbors of the separate VMs use less resource. 

On Fri, Jan 18, 2013 at 3:34 PM, Panshul Whisper < ouchwhisper@gmail.com > wrote: 

ah now i understand what you mean. 
I will be creating 20 individual servers on the cloud, and not create one big server and make several virtual nodes inside it. 
I will be paying for 20 different nodes.. all configured with hadoop and connected to the cluster. 

Thanx for the intel :) 

On Fri, Jan 18, 2013 at 11:59 PM, Ted Dunning < tdunning@maprtech.com > wrote: 

<blockquote>
It is usually better to not subdivide nodes into virtual nodes. You will generally get better performance form the original node because you only pay for the OS once and because your disk I/O will be scheduled better. 

If you look at EC2 pricing, however, the spot market often has arbitrage opportunities where one size node is absurdly cheap relative to others. In that case, it pays to scale the individual nodes up or down. 

The only reasonable reason to split nodes to very small levels is for testing and training. 

On Fri, Jan 18, 2013 at 2:30 PM, Panshul Whisper < ouchwhisper@gmail.com > wrote: 

<blockquote>

Thnx for the reply Ted, 
You can find 40 GB disks when u make virtual nodes on a cloud like Rackspace ;-) 
About the os partitions I did not exactly understand what you meant. 
I have made a server on the cloud.. And I just installed and configured hadoop and hbase in the /use/local folder. 
And I am pretty sure it does not have a separate partition for root. 
Please help me explain what u meant and what else precautions should I take. 
Thanks, 
Regards, 
Ouch Whisper 
01010101010 

On Jan 18, 2013 11:11 PM, "Ted Dunning" < tdunning@maprtech.com > wrote: 

<blockquote>

Where do you find 40gb disks now a days? 

Normally your performance is going to be better with more space but your network may be your limiting factor for some computations. That could give you some paradoxical scaling. Hbase will rarely show this behavior. 

Keep in mind you also want to allow for an os partition. Current standard practice is to reserve as much as 100 GB for that partition but in your case 10gb better:-) 

Note that if you account for this, the node counts don't scale as simply. The overhead of these os partitions goes up with number of nodes. 

On Jan 18, 2013, at 8:55 AM, Panshul Whisper < ouchwhisper@gmail.com > wrote: 

<blockquote>

If we look at it with performance in mind, 
is it better to have 20 Nodes with 40 GB HDD 
or is it better to have 10 Nodes with 80 GB HDD? 

they are connected on a gigabit LAN 

Thnx 

On Fri, Jan 18, 2013 at 2:26 PM, Jean-Marc Spaggiari < jean-marc@spaggiari.org > wrote: 

<blockquote>
20 nodes with 40 GB will do the work. 

After that you will have to consider performances based on your access 
pattern. But that's another story. 

JM 

2013/1/18, Panshul Whisper < ouchwhisper@gmail.com >: 
> Thank you for the replies, 
> 
> So I take it that I should have atleast 800 GB on total free space on 
> HDFS.. (combined free space of all the nodes connected to the cluster). So 
> I can connect 20 nodes having 40 GB of hdd on each node to my cluster. Will 
> this be enough for the storage? 
> Please confirm. 
> 
> Thanking You, 
> Regards, 
> Panshul. 
> 
> 
> On Fri, Jan 18, 2013 at 1:36 PM, Jean-Marc Spaggiari < 
> jean-marc@spaggiari.org > wrote: 
> 
>> Hi Panshul, 
>> 
>> If you have 20 GB with a replication factor set to 3, you have only 
>> 6.6GB available, not 11GB. You need to divide the total space by the 
>> replication factor. 
>> 
>> Also, if you store your JSon into HBase, you need to add the key size 
>> to it. If you key is 4 bytes, or 1024 bytes, it makes a difference. 
>> 
>> So roughly, 24 000 000 * 5 * 1024 = 114GB. You don't have the space to 
>> store it. Without including the key size. Even with a replication 
>> factor set to 5 you don't have the space. 
>> 
>> Now, you can add some compression, but even with a lucky factor of 50% 
>> you still don't have the space. You will need something like 90% 
>> compression factor to be able to store this data in your cluster. 
>> 
>> A 1T drive is now less than $100... So you might think about replacing 
>> you 20 GB drives by something bigger. 
>> to reply to your last question, for your data here, you will need AT 
>> LEAST 350GB overall storage. But that's a bare minimum. Don't go under 
>> 500GB. 
>> 
>> IMHO 
>> 
>> JM 
>> 
>> 2013/1/18, Panshul Whisper < ouchwhisper@gmail.com >: 
>> > Hello, 
>> > 
>> > I was estimating how much disk space do I need for my cluster. 
>> > 
>> > I have 24 million JSON documents approx. 5kb each 
>> > the Json is to be stored into HBASE with some identifying data in 
>> coloumns 
>> > and I also want to store the Json for later retrieval based on the Id 
>> data 
>> > as keys in Hbase. 
>> > I have my HDFS replication set to 3 
>> > each node has Hadoop and hbase and Ubuntu installed on it.. so approx 
>> > 11 
>> GB 
>> > is available for use on my 20 GB node. 
>> > 
>> > I have no idea, if I have not enabled Hbase replication, is the HDFS 
>> > replication enough to keep the data safe and redundant. 
>> > How much total disk space I will need for the storage of the data. 
>> > 
>> > Please help me estimate this. 
>> > 
>> > Thank you so much. 
>> > 
>> > -- 
>> > Regards, 
>> > Ouch Whisper 
>> > 010101010101 
>> > 
>> 
> 
> 
> 
> -- 
> Regards, 
> Ouch Whisper 
> 010101010101 
> 

-- 

Regards, Ouch Whisper 
010101010101 
</blockquote>

</blockquote>

</blockquote>

</blockquote>

-- 

Regards, Ouch Whisper 
010101010101 
</blockquote>

Re: Estimating disk space requirements

Posted by Jeffrey Buell <jb...@vmware.com>.

I disagree. There are some significant advantages to using "many small nodes" instead of "few big nodes". As Ted points out, there are some disadvantages as well, so you have to look at the trade-offs. But consider: 

- NUMA: If your hadoop nodes span physical NUMA nodes, then performance will suffer from remote memory accesses. The Linux scheduler tries to minimize this, but I've found that about 1/3 of memory accesses are remote on a 2-socket machine. This effect will be more severe on bigger machines. Hadoop nodes that fit on a NUMA node will have not access remote memory at all (at least on vSphere). 

- Disk partitioning: Smaller nodes with fewer disks each can significantly increase average disk utilization, not decrease it. Having many threads operating against many disks in the "big node" case tends to leave some disks idle while others are over-subscribed. Partitioning disks among nodes decreases this effect. The extreme case is one disk per node, where no disks will be idle as long as there is work to do. 

- Management: Not a performance effect, but smaller nodes enable easier multi-tenancy, multiple virtual Hadoop clusters, sharing physical hardware with other workloads, etc. 

Jeff 

----- Original Message -----

From: "Ted Dunning" <td...@maprtech.com> 
To: user@hadoop.apache.org 
Sent: Friday, January 18, 2013 3:36:30 PM 
Subject: Re: Estimating disk space requirements 

If you make 20 individual small servers, that isn't much different from 20 from one server. The only difference would be if the neighbors of the separate VMs use less resource. 

On Fri, Jan 18, 2013 at 3:34 PM, Panshul Whisper < ouchwhisper@gmail.com > wrote: 

ah now i understand what you mean. 
I will be creating 20 individual servers on the cloud, and not create one big server and make several virtual nodes inside it. 
I will be paying for 20 different nodes.. all configured with hadoop and connected to the cluster. 

Thanx for the intel :) 

On Fri, Jan 18, 2013 at 11:59 PM, Ted Dunning < tdunning@maprtech.com > wrote: 

<blockquote>
It is usually better to not subdivide nodes into virtual nodes. You will generally get better performance form the original node because you only pay for the OS once and because your disk I/O will be scheduled better. 

If you look at EC2 pricing, however, the spot market often has arbitrage opportunities where one size node is absurdly cheap relative to others. In that case, it pays to scale the individual nodes up or down. 

The only reasonable reason to split nodes to very small levels is for testing and training. 

On Fri, Jan 18, 2013 at 2:30 PM, Panshul Whisper < ouchwhisper@gmail.com > wrote: 

<blockquote>

Thnx for the reply Ted, 
You can find 40 GB disks when u make virtual nodes on a cloud like Rackspace ;-) 
About the os partitions I did not exactly understand what you meant. 
I have made a server on the cloud.. And I just installed and configured hadoop and hbase in the /use/local folder. 
And I am pretty sure it does not have a separate partition for root. 
Please help me explain what u meant and what else precautions should I take. 
Thanks, 
Regards, 
Ouch Whisper 
01010101010 

On Jan 18, 2013 11:11 PM, "Ted Dunning" < tdunning@maprtech.com > wrote: 

<blockquote>

Where do you find 40gb disks now a days? 

Normally your performance is going to be better with more space but your network may be your limiting factor for some computations. That could give you some paradoxical scaling. Hbase will rarely show this behavior. 

Keep in mind you also want to allow for an os partition. Current standard practice is to reserve as much as 100 GB for that partition but in your case 10gb better:-) 

Note that if you account for this, the node counts don't scale as simply. The overhead of these os partitions goes up with number of nodes. 

On Jan 18, 2013, at 8:55 AM, Panshul Whisper < ouchwhisper@gmail.com > wrote: 

<blockquote>

If we look at it with performance in mind, 
is it better to have 20 Nodes with 40 GB HDD 
or is it better to have 10 Nodes with 80 GB HDD? 

they are connected on a gigabit LAN 

Thnx 

On Fri, Jan 18, 2013 at 2:26 PM, Jean-Marc Spaggiari < jean-marc@spaggiari.org > wrote: 

<blockquote>
20 nodes with 40 GB will do the work. 

After that you will have to consider performances based on your access 
pattern. But that's another story. 

JM 

2013/1/18, Panshul Whisper < ouchwhisper@gmail.com >: 
> Thank you for the replies, 
> 
> So I take it that I should have atleast 800 GB on total free space on 
> HDFS.. (combined free space of all the nodes connected to the cluster). So 
> I can connect 20 nodes having 40 GB of hdd on each node to my cluster. Will 
> this be enough for the storage? 
> Please confirm. 
> 
> Thanking You, 
> Regards, 
> Panshul. 
> 
> 
> On Fri, Jan 18, 2013 at 1:36 PM, Jean-Marc Spaggiari < 
> jean-marc@spaggiari.org > wrote: 
> 
>> Hi Panshul, 
>> 
>> If you have 20 GB with a replication factor set to 3, you have only 
>> 6.6GB available, not 11GB. You need to divide the total space by the 
>> replication factor. 
>> 
>> Also, if you store your JSon into HBase, you need to add the key size 
>> to it. If you key is 4 bytes, or 1024 bytes, it makes a difference. 
>> 
>> So roughly, 24 000 000 * 5 * 1024 = 114GB. You don't have the space to 
>> store it. Without including the key size. Even with a replication 
>> factor set to 5 you don't have the space. 
>> 
>> Now, you can add some compression, but even with a lucky factor of 50% 
>> you still don't have the space. You will need something like 90% 
>> compression factor to be able to store this data in your cluster. 
>> 
>> A 1T drive is now less than $100... So you might think about replacing 
>> you 20 GB drives by something bigger. 
>> to reply to your last question, for your data here, you will need AT 
>> LEAST 350GB overall storage. But that's a bare minimum. Don't go under 
>> 500GB. 
>> 
>> IMHO 
>> 
>> JM 
>> 
>> 2013/1/18, Panshul Whisper < ouchwhisper@gmail.com >: 
>> > Hello, 
>> > 
>> > I was estimating how much disk space do I need for my cluster. 
>> > 
>> > I have 24 million JSON documents approx. 5kb each 
>> > the Json is to be stored into HBASE with some identifying data in 
>> coloumns 
>> > and I also want to store the Json for later retrieval based on the Id 
>> data 
>> > as keys in Hbase. 
>> > I have my HDFS replication set to 3 
>> > each node has Hadoop and hbase and Ubuntu installed on it.. so approx 
>> > 11 
>> GB 
>> > is available for use on my 20 GB node. 
>> > 
>> > I have no idea, if I have not enabled Hbase replication, is the HDFS 
>> > replication enough to keep the data safe and redundant. 
>> > How much total disk space I will need for the storage of the data. 
>> > 
>> > Please help me estimate this. 
>> > 
>> > Thank you so much. 
>> > 
>> > -- 
>> > Regards, 
>> > Ouch Whisper 
>> > 010101010101 
>> > 
>> 
> 
> 
> 
> -- 
> Regards, 
> Ouch Whisper 
> 010101010101 
> 

-- 

Regards, Ouch Whisper 
010101010101 
</blockquote>

</blockquote>

</blockquote>

</blockquote>

-- 

Regards, Ouch Whisper 
010101010101 
</blockquote>

Re: Estimating disk space requirements

Posted by Ted Dunning <td...@maprtech.com>.

If you make 20 individual small servers, that isn't much different from 20
from one server.  The only difference would be if the neighbors of the
separate VMs use less resource.

On Fri, Jan 18, 2013 at 3:34 PM, Panshul Whisper <ou...@gmail.com>wrote:

> ah now i understand what you mean.
> I will be creating 20 individual servers on the cloud, and not create one
> big server and make several virtual nodes inside it.
> I will be paying for 20 different nodes.. all configured with hadoop and
> connected to the cluster.
>
> Thanx for the intel :)
>
>
> On Fri, Jan 18, 2013 at 11:59 PM, Ted Dunning <td...@maprtech.com>wrote:
>
>> It is usually better to not subdivide nodes into virtual nodes.  You will
>> generally get better performance form the original node because you only
>> pay for the OS once and because your disk I/O will be scheduled better.
>>
>> If you look at EC2 pricing, however, the spot market often has arbitrage
>> opportunities where one size node is absurdly cheap relative to others.  In
>> that case, it pays to scale the individual nodes up or down.
>>
>> The only reasonable reason to split nodes to very small levels is for
>> testing and training.
>>
>>
>> On Fri, Jan 18, 2013 at 2:30 PM, Panshul Whisper <ou...@gmail.com>wrote:
>>
>>> Thnx for the reply Ted,
>>>
>>> You can find 40 GB disks when u make virtual nodes on a cloud like
>>> Rackspace ;-)
>>>
>>> About the os partitions I did not exactly understand what you meant.
>>> I have made a server on the cloud.. And I just installed and configured
>>> hadoop and hbase in the /use/local folder.
>>> And I am pretty sure it does not have a separate partition for root.
>>>
>>> Please help me explain what u meant and what else precautions should I
>>> take.
>>>
>>> Thanks,
>>>
>>> Regards,
>>> Ouch Whisper
>>> 01010101010
>>> On Jan 18, 2013 11:11 PM, "Ted Dunning" <td...@maprtech.com> wrote:
>>>
>>>> Where do you find 40gb disks now a days?
>>>>
>>>> Normally your performance is going to be better with more space but
>>>> your network may be your limiting factor for some computations.  That could
>>>> give you some paradoxical scaling.  Hbase will rarely show this behavior.
>>>>
>>>> Keep in mind you also want to allow for an os partition. Current
>>>> standard practice is to reserve as much as 100 GB for that partition but in
>>>> your case 10gb better:-)
>>>>
>>>> Note that if you account for this, the node counts don't scale as
>>>> simply.  The overhead of these os partitions goes up with number of nodes.
>>>>
>>>> On Jan 18, 2013, at 8:55 AM, Panshul Whisper <ou...@gmail.com>
>>>> wrote:
>>>>
>>>> If we look at it with performance in mind,
>>>> is it better to have 20 Nodes with 40 GB HDD
>>>> or is it better to have 10 Nodes with 80 GB HDD?
>>>>
>>>> they are connected on a gigabit LAN
>>>>
>>>> Thnx
>>>>
>>>>
>>>> On Fri, Jan 18, 2013 at 2:26 PM, Jean-Marc Spaggiari <
>>>> jean-marc@spaggiari.org> wrote:
>>>>
>>>>> 20 nodes with 40 GB will do the work.
>>>>>
>>>>> After that you will have to consider performances based on your access
>>>>> pattern. But that's another story.
>>>>>
>>>>> JM
>>>>>
>>>>> 2013/1/18, Panshul Whisper <ou...@gmail.com>:
>>>>> > Thank you for the replies,
>>>>> >
>>>>> > So I take it that I should have atleast 800 GB on total free space on
>>>>> > HDFS.. (combined free space of all the nodes connected to the
>>>>> cluster). So
>>>>> > I can connect 20 nodes having 40 GB of hdd on each node to my
>>>>> cluster. Will
>>>>> > this be enough for the storage?
>>>>> > Please confirm.
>>>>> >
>>>>> > Thanking You,
>>>>> > Regards,
>>>>> > Panshul.
>>>>> >
>>>>> >
>>>>> > On Fri, Jan 18, 2013 at 1:36 PM, Jean-Marc Spaggiari <
>>>>> > jean-marc@spaggiari.org> wrote:
>>>>> >
>>>>> >> Hi Panshul,
>>>>> >>
>>>>> >> If you have 20 GB with a replication factor set to 3, you have only
>>>>> >> 6.6GB available, not 11GB. You need to divide the total space by the
>>>>> >> replication factor.
>>>>> >>
>>>>> >> Also, if you store your JSon into HBase, you need to add the key
>>>>> size
>>>>> >> to it. If you key is 4 bytes, or 1024 bytes, it makes a difference.
>>>>> >>
>>>>> >> So roughly, 24 000 000 * 5 * 1024 = 114GB. You don't have the space
>>>>> to
>>>>> >> store it. Without including the key size. Even with a replication
>>>>> >> factor set to 5 you don't have the space.
>>>>> >>
>>>>> >> Now, you can add some compression, but even with a lucky factor of
>>>>> 50%
>>>>> >> you still don't have the space. You will need something like 90%
>>>>> >> compression factor to be able to store this data in your cluster.
>>>>> >>
>>>>> >> A 1T drive is now less than $100... So you might think about
>>>>> replacing
>>>>> >> you 20 GB drives by something bigger.
>>>>> >> to reply to your last question, for your data here, you will need AT
>>>>> >> LEAST 350GB overall storage. But that's a bare minimum. Don't go
>>>>> under
>>>>> >> 500GB.
>>>>> >>
>>>>> >> IMHO
>>>>> >>
>>>>> >> JM
>>>>> >>
>>>>> >> 2013/1/18, Panshul Whisper <ou...@gmail.com>:
>>>>> >> > Hello,
>>>>> >> >
>>>>> >> > I was estimating how much disk space do I need for my cluster.
>>>>> >> >
>>>>> >> > I have 24 million JSON documents approx. 5kb each
>>>>> >> > the Json is to be stored into HBASE with some identifying data in
>>>>> >> coloumns
>>>>> >> > and I also want to store the Json for later retrieval based on
>>>>> the Id
>>>>> >> data
>>>>> >> > as keys in Hbase.
>>>>> >> > I have my HDFS replication set to 3
>>>>> >> > each node has Hadoop and hbase and Ubuntu installed on it.. so
>>>>> approx
>>>>> >> > 11
>>>>> >> GB
>>>>> >> > is available for use on my 20 GB node.
>>>>> >> >
>>>>> >> > I have no idea, if I have not enabled Hbase replication, is the
>>>>> HDFS
>>>>> >> > replication enough to keep the data safe and redundant.
>>>>> >> > How much total disk space I will need for the storage of the data.
>>>>> >> >
>>>>> >> > Please help me estimate this.
>>>>> >> >
>>>>> >> > Thank you so much.
>>>>> >> >
>>>>> >> > --
>>>>> >> > Regards,
>>>>> >> > Ouch Whisper
>>>>> >> > 010101010101
>>>>> >> >
>>>>> >>
>>>>> >
>>>>> >
>>>>> >
>>>>> > --
>>>>> > Regards,
>>>>> > Ouch Whisper
>>>>> > 010101010101
>>>>> >
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Regards,
>>>> Ouch Whisper
>>>> 010101010101
>>>>
>>>>
>>
>
>
> --
> Regards,
> Ouch Whisper
> 010101010101
>

Re: Estimating disk space requirements

Posted by Ted Dunning <td...@maprtech.com>.

If you make 20 individual small servers, that isn't much different from 20
from one server.  The only difference would be if the neighbors of the
separate VMs use less resource.

On Fri, Jan 18, 2013 at 3:34 PM, Panshul Whisper <ou...@gmail.com>wrote:

> ah now i understand what you mean.
> I will be creating 20 individual servers on the cloud, and not create one
> big server and make several virtual nodes inside it.
> I will be paying for 20 different nodes.. all configured with hadoop and
> connected to the cluster.
>
> Thanx for the intel :)
>
>
> On Fri, Jan 18, 2013 at 11:59 PM, Ted Dunning <td...@maprtech.com>wrote:
>
>> It is usually better to not subdivide nodes into virtual nodes.  You will
>> generally get better performance form the original node because you only
>> pay for the OS once and because your disk I/O will be scheduled better.
>>
>> If you look at EC2 pricing, however, the spot market often has arbitrage
>> opportunities where one size node is absurdly cheap relative to others.  In
>> that case, it pays to scale the individual nodes up or down.
>>
>> The only reasonable reason to split nodes to very small levels is for
>> testing and training.
>>
>>
>> On Fri, Jan 18, 2013 at 2:30 PM, Panshul Whisper <ou...@gmail.com>wrote:
>>
>>> Thnx for the reply Ted,
>>>
>>> You can find 40 GB disks when u make virtual nodes on a cloud like
>>> Rackspace ;-)
>>>
>>> About the os partitions I did not exactly understand what you meant.
>>> I have made a server on the cloud.. And I just installed and configured
>>> hadoop and hbase in the /use/local folder.
>>> And I am pretty sure it does not have a separate partition for root.
>>>
>>> Please help me explain what u meant and what else precautions should I
>>> take.
>>>
>>> Thanks,
>>>
>>> Regards,
>>> Ouch Whisper
>>> 01010101010
>>> On Jan 18, 2013 11:11 PM, "Ted Dunning" <td...@maprtech.com> wrote:
>>>
>>>> Where do you find 40gb disks now a days?
>>>>
>>>> Normally your performance is going to be better with more space but
>>>> your network may be your limiting factor for some computations.  That could
>>>> give you some paradoxical scaling.  Hbase will rarely show this behavior.
>>>>
>>>> Keep in mind you also want to allow for an os partition. Current
>>>> standard practice is to reserve as much as 100 GB for that partition but in
>>>> your case 10gb better:-)
>>>>
>>>> Note that if you account for this, the node counts don't scale as
>>>> simply.  The overhead of these os partitions goes up with number of nodes.
>>>>
>>>> On Jan 18, 2013, at 8:55 AM, Panshul Whisper <ou...@gmail.com>
>>>> wrote:
>>>>
>>>> If we look at it with performance in mind,
>>>> is it better to have 20 Nodes with 40 GB HDD
>>>> or is it better to have 10 Nodes with 80 GB HDD?
>>>>
>>>> they are connected on a gigabit LAN
>>>>
>>>> Thnx
>>>>
>>>>
>>>> On Fri, Jan 18, 2013 at 2:26 PM, Jean-Marc Spaggiari <
>>>> jean-marc@spaggiari.org> wrote:
>>>>
>>>>> 20 nodes with 40 GB will do the work.
>>>>>
>>>>> After that you will have to consider performances based on your access
>>>>> pattern. But that's another story.
>>>>>
>>>>> JM
>>>>>
>>>>> 2013/1/18, Panshul Whisper <ou...@gmail.com>:
>>>>> > Thank you for the replies,
>>>>> >
>>>>> > So I take it that I should have atleast 800 GB on total free space on
>>>>> > HDFS.. (combined free space of all the nodes connected to the
>>>>> cluster). So
>>>>> > I can connect 20 nodes having 40 GB of hdd on each node to my
>>>>> cluster. Will
>>>>> > this be enough for the storage?
>>>>> > Please confirm.
>>>>> >
>>>>> > Thanking You,
>>>>> > Regards,
>>>>> > Panshul.
>>>>> >
>>>>> >
>>>>> > On Fri, Jan 18, 2013 at 1:36 PM, Jean-Marc Spaggiari <
>>>>> > jean-marc@spaggiari.org> wrote:
>>>>> >
>>>>> >> Hi Panshul,
>>>>> >>
>>>>> >> If you have 20 GB with a replication factor set to 3, you have only
>>>>> >> 6.6GB available, not 11GB. You need to divide the total space by the
>>>>> >> replication factor.
>>>>> >>
>>>>> >> Also, if you store your JSon into HBase, you need to add the key
>>>>> size
>>>>> >> to it. If you key is 4 bytes, or 1024 bytes, it makes a difference.
>>>>> >>
>>>>> >> So roughly, 24 000 000 * 5 * 1024 = 114GB. You don't have the space
>>>>> to
>>>>> >> store it. Without including the key size. Even with a replication
>>>>> >> factor set to 5 you don't have the space.
>>>>> >>
>>>>> >> Now, you can add some compression, but even with a lucky factor of
>>>>> 50%
>>>>> >> you still don't have the space. You will need something like 90%
>>>>> >> compression factor to be able to store this data in your cluster.
>>>>> >>
>>>>> >> A 1T drive is now less than $100... So you might think about
>>>>> replacing
>>>>> >> you 20 GB drives by something bigger.
>>>>> >> to reply to your last question, for your data here, you will need AT
>>>>> >> LEAST 350GB overall storage. But that's a bare minimum. Don't go
>>>>> under
>>>>> >> 500GB.
>>>>> >>
>>>>> >> IMHO
>>>>> >>
>>>>> >> JM
>>>>> >>
>>>>> >> 2013/1/18, Panshul Whisper <ou...@gmail.com>:
>>>>> >> > Hello,
>>>>> >> >
>>>>> >> > I was estimating how much disk space do I need for my cluster.
>>>>> >> >
>>>>> >> > I have 24 million JSON documents approx. 5kb each
>>>>> >> > the Json is to be stored into HBASE with some identifying data in
>>>>> >> coloumns
>>>>> >> > and I also want to store the Json for later retrieval based on
>>>>> the Id
>>>>> >> data
>>>>> >> > as keys in Hbase.
>>>>> >> > I have my HDFS replication set to 3
>>>>> >> > each node has Hadoop and hbase and Ubuntu installed on it.. so
>>>>> approx
>>>>> >> > 11
>>>>> >> GB
>>>>> >> > is available for use on my 20 GB node.
>>>>> >> >
>>>>> >> > I have no idea, if I have not enabled Hbase replication, is the
>>>>> HDFS
>>>>> >> > replication enough to keep the data safe and redundant.
>>>>> >> > How much total disk space I will need for the storage of the data.
>>>>> >> >
>>>>> >> > Please help me estimate this.
>>>>> >> >
>>>>> >> > Thank you so much.
>>>>> >> >
>>>>> >> > --
>>>>> >> > Regards,
>>>>> >> > Ouch Whisper
>>>>> >> > 010101010101
>>>>> >> >
>>>>> >>
>>>>> >
>>>>> >
>>>>> >
>>>>> > --
>>>>> > Regards,
>>>>> > Ouch Whisper
>>>>> > 010101010101
>>>>> >
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Regards,
>>>> Ouch Whisper
>>>> 010101010101
>>>>
>>>>
>>
>
>
> --
> Regards,
> Ouch Whisper
> 010101010101
>

Re: Estimating disk space requirements

Posted by Ted Dunning <td...@maprtech.com>.

If you make 20 individual small servers, that isn't much different from 20
from one server.  The only difference would be if the neighbors of the
separate VMs use less resource.

On Fri, Jan 18, 2013 at 3:34 PM, Panshul Whisper <ou...@gmail.com>wrote:

> ah now i understand what you mean.
> I will be creating 20 individual servers on the cloud, and not create one
> big server and make several virtual nodes inside it.
> I will be paying for 20 different nodes.. all configured with hadoop and
> connected to the cluster.
>
> Thanx for the intel :)
>
>
> On Fri, Jan 18, 2013 at 11:59 PM, Ted Dunning <td...@maprtech.com>wrote:
>
>> It is usually better to not subdivide nodes into virtual nodes.  You will
>> generally get better performance form the original node because you only
>> pay for the OS once and because your disk I/O will be scheduled better.
>>
>> If you look at EC2 pricing, however, the spot market often has arbitrage
>> opportunities where one size node is absurdly cheap relative to others.  In
>> that case, it pays to scale the individual nodes up or down.
>>
>> The only reasonable reason to split nodes to very small levels is for
>> testing and training.
>>
>>
>> On Fri, Jan 18, 2013 at 2:30 PM, Panshul Whisper <ou...@gmail.com>wrote:
>>
>>> Thnx for the reply Ted,
>>>
>>> You can find 40 GB disks when u make virtual nodes on a cloud like
>>> Rackspace ;-)
>>>
>>> About the os partitions I did not exactly understand what you meant.
>>> I have made a server on the cloud.. And I just installed and configured
>>> hadoop and hbase in the /use/local folder.
>>> And I am pretty sure it does not have a separate partition for root.
>>>
>>> Please help me explain what u meant and what else precautions should I
>>> take.
>>>
>>> Thanks,
>>>
>>> Regards,
>>> Ouch Whisper
>>> 01010101010
>>> On Jan 18, 2013 11:11 PM, "Ted Dunning" <td...@maprtech.com> wrote:
>>>
>>>> Where do you find 40gb disks now a days?
>>>>
>>>> Normally your performance is going to be better with more space but
>>>> your network may be your limiting factor for some computations.  That could
>>>> give you some paradoxical scaling.  Hbase will rarely show this behavior.
>>>>
>>>> Keep in mind you also want to allow for an os partition. Current
>>>> standard practice is to reserve as much as 100 GB for that partition but in
>>>> your case 10gb better:-)
>>>>
>>>> Note that if you account for this, the node counts don't scale as
>>>> simply.  The overhead of these os partitions goes up with number of nodes.
>>>>
>>>> On Jan 18, 2013, at 8:55 AM, Panshul Whisper <ou...@gmail.com>
>>>> wrote:
>>>>
>>>> If we look at it with performance in mind,
>>>> is it better to have 20 Nodes with 40 GB HDD
>>>> or is it better to have 10 Nodes with 80 GB HDD?
>>>>
>>>> they are connected on a gigabit LAN
>>>>
>>>> Thnx
>>>>
>>>>
>>>> On Fri, Jan 18, 2013 at 2:26 PM, Jean-Marc Spaggiari <
>>>> jean-marc@spaggiari.org> wrote:
>>>>
>>>>> 20 nodes with 40 GB will do the work.
>>>>>
>>>>> After that you will have to consider performances based on your access
>>>>> pattern. But that's another story.
>>>>>
>>>>> JM
>>>>>
>>>>> 2013/1/18, Panshul Whisper <ou...@gmail.com>:
>>>>> > Thank you for the replies,
>>>>> >
>>>>> > So I take it that I should have atleast 800 GB on total free space on
>>>>> > HDFS.. (combined free space of all the nodes connected to the
>>>>> cluster). So
>>>>> > I can connect 20 nodes having 40 GB of hdd on each node to my
>>>>> cluster. Will
>>>>> > this be enough for the storage?
>>>>> > Please confirm.
>>>>> >
>>>>> > Thanking You,
>>>>> > Regards,
>>>>> > Panshul.
>>>>> >
>>>>> >
>>>>> > On Fri, Jan 18, 2013 at 1:36 PM, Jean-Marc Spaggiari <
>>>>> > jean-marc@spaggiari.org> wrote:
>>>>> >
>>>>> >> Hi Panshul,
>>>>> >>
>>>>> >> If you have 20 GB with a replication factor set to 3, you have only
>>>>> >> 6.6GB available, not 11GB. You need to divide the total space by the
>>>>> >> replication factor.
>>>>> >>
>>>>> >> Also, if you store your JSon into HBase, you need to add the key
>>>>> size
>>>>> >> to it. If you key is 4 bytes, or 1024 bytes, it makes a difference.
>>>>> >>
>>>>> >> So roughly, 24 000 000 * 5 * 1024 = 114GB. You don't have the space
>>>>> to
>>>>> >> store it. Without including the key size. Even with a replication
>>>>> >> factor set to 5 you don't have the space.
>>>>> >>
>>>>> >> Now, you can add some compression, but even with a lucky factor of
>>>>> 50%
>>>>> >> you still don't have the space. You will need something like 90%
>>>>> >> compression factor to be able to store this data in your cluster.
>>>>> >>
>>>>> >> A 1T drive is now less than $100... So you might think about
>>>>> replacing
>>>>> >> you 20 GB drives by something bigger.
>>>>> >> to reply to your last question, for your data here, you will need AT
>>>>> >> LEAST 350GB overall storage. But that's a bare minimum. Don't go
>>>>> under
>>>>> >> 500GB.
>>>>> >>
>>>>> >> IMHO
>>>>> >>
>>>>> >> JM
>>>>> >>
>>>>> >> 2013/1/18, Panshul Whisper <ou...@gmail.com>:
>>>>> >> > Hello,
>>>>> >> >
>>>>> >> > I was estimating how much disk space do I need for my cluster.
>>>>> >> >
>>>>> >> > I have 24 million JSON documents approx. 5kb each
>>>>> >> > the Json is to be stored into HBASE with some identifying data in
>>>>> >> coloumns
>>>>> >> > and I also want to store the Json for later retrieval based on
>>>>> the Id
>>>>> >> data
>>>>> >> > as keys in Hbase.
>>>>> >> > I have my HDFS replication set to 3
>>>>> >> > each node has Hadoop and hbase and Ubuntu installed on it.. so
>>>>> approx
>>>>> >> > 11
>>>>> >> GB
>>>>> >> > is available for use on my 20 GB node.
>>>>> >> >
>>>>> >> > I have no idea, if I have not enabled Hbase replication, is the
>>>>> HDFS
>>>>> >> > replication enough to keep the data safe and redundant.
>>>>> >> > How much total disk space I will need for the storage of the data.
>>>>> >> >
>>>>> >> > Please help me estimate this.
>>>>> >> >
>>>>> >> > Thank you so much.
>>>>> >> >
>>>>> >> > --
>>>>> >> > Regards,
>>>>> >> > Ouch Whisper
>>>>> >> > 010101010101
>>>>> >> >
>>>>> >>
>>>>> >
>>>>> >
>>>>> >
>>>>> > --
>>>>> > Regards,
>>>>> > Ouch Whisper
>>>>> > 010101010101
>>>>> >
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Regards,
>>>> Ouch Whisper
>>>> 010101010101
>>>>
>>>>
>>
>
>
> --
> Regards,
> Ouch Whisper
> 010101010101
>

Re: Estimating disk space requirements

Posted by Ted Dunning <td...@maprtech.com>.

If you make 20 individual small servers, that isn't much different from 20
from one server.  The only difference would be if the neighbors of the
separate VMs use less resource.

On Fri, Jan 18, 2013 at 3:34 PM, Panshul Whisper <ou...@gmail.com>wrote:

> ah now i understand what you mean.
> I will be creating 20 individual servers on the cloud, and not create one
> big server and make several virtual nodes inside it.
> I will be paying for 20 different nodes.. all configured with hadoop and
> connected to the cluster.
>
> Thanx for the intel :)
>
>
> On Fri, Jan 18, 2013 at 11:59 PM, Ted Dunning <td...@maprtech.com>wrote:
>
>> It is usually better to not subdivide nodes into virtual nodes.  You will
>> generally get better performance form the original node because you only
>> pay for the OS once and because your disk I/O will be scheduled better.
>>
>> If you look at EC2 pricing, however, the spot market often has arbitrage
>> opportunities where one size node is absurdly cheap relative to others.  In
>> that case, it pays to scale the individual nodes up or down.
>>
>> The only reasonable reason to split nodes to very small levels is for
>> testing and training.
>>
>>
>> On Fri, Jan 18, 2013 at 2:30 PM, Panshul Whisper <ou...@gmail.com>wrote:
>>
>>> Thnx for the reply Ted,
>>>
>>> You can find 40 GB disks when u make virtual nodes on a cloud like
>>> Rackspace ;-)
>>>
>>> About the os partitions I did not exactly understand what you meant.
>>> I have made a server on the cloud.. And I just installed and configured
>>> hadoop and hbase in the /use/local folder.
>>> And I am pretty sure it does not have a separate partition for root.
>>>
>>> Please help me explain what u meant and what else precautions should I
>>> take.
>>>
>>> Thanks,
>>>
>>> Regards,
>>> Ouch Whisper
>>> 01010101010
>>> On Jan 18, 2013 11:11 PM, "Ted Dunning" <td...@maprtech.com> wrote:
>>>
>>>> Where do you find 40gb disks now a days?
>>>>
>>>> Normally your performance is going to be better with more space but
>>>> your network may be your limiting factor for some computations.  That could
>>>> give you some paradoxical scaling.  Hbase will rarely show this behavior.
>>>>
>>>> Keep in mind you also want to allow for an os partition. Current
>>>> standard practice is to reserve as much as 100 GB for that partition but in
>>>> your case 10gb better:-)
>>>>
>>>> Note that if you account for this, the node counts don't scale as
>>>> simply.  The overhead of these os partitions goes up with number of nodes.
>>>>
>>>> On Jan 18, 2013, at 8:55 AM, Panshul Whisper <ou...@gmail.com>
>>>> wrote:
>>>>
>>>> If we look at it with performance in mind,
>>>> is it better to have 20 Nodes with 40 GB HDD
>>>> or is it better to have 10 Nodes with 80 GB HDD?
>>>>
>>>> they are connected on a gigabit LAN
>>>>
>>>> Thnx
>>>>
>>>>
>>>> On Fri, Jan 18, 2013 at 2:26 PM, Jean-Marc Spaggiari <
>>>> jean-marc@spaggiari.org> wrote:
>>>>
>>>>> 20 nodes with 40 GB will do the work.
>>>>>
>>>>> After that you will have to consider performances based on your access
>>>>> pattern. But that's another story.
>>>>>
>>>>> JM
>>>>>
>>>>> 2013/1/18, Panshul Whisper <ou...@gmail.com>:
>>>>> > Thank you for the replies,
>>>>> >
>>>>> > So I take it that I should have atleast 800 GB on total free space on
>>>>> > HDFS.. (combined free space of all the nodes connected to the
>>>>> cluster). So
>>>>> > I can connect 20 nodes having 40 GB of hdd on each node to my
>>>>> cluster. Will
>>>>> > this be enough for the storage?
>>>>> > Please confirm.
>>>>> >
>>>>> > Thanking You,
>>>>> > Regards,
>>>>> > Panshul.
>>>>> >
>>>>> >
>>>>> > On Fri, Jan 18, 2013 at 1:36 PM, Jean-Marc Spaggiari <
>>>>> > jean-marc@spaggiari.org> wrote:
>>>>> >
>>>>> >> Hi Panshul,
>>>>> >>
>>>>> >> If you have 20 GB with a replication factor set to 3, you have only
>>>>> >> 6.6GB available, not 11GB. You need to divide the total space by the
>>>>> >> replication factor.
>>>>> >>
>>>>> >> Also, if you store your JSon into HBase, you need to add the key
>>>>> size
>>>>> >> to it. If you key is 4 bytes, or 1024 bytes, it makes a difference.
>>>>> >>
>>>>> >> So roughly, 24 000 000 * 5 * 1024 = 114GB. You don't have the space
>>>>> to
>>>>> >> store it. Without including the key size. Even with a replication
>>>>> >> factor set to 5 you don't have the space.
>>>>> >>
>>>>> >> Now, you can add some compression, but even with a lucky factor of
>>>>> 50%
>>>>> >> you still don't have the space. You will need something like 90%
>>>>> >> compression factor to be able to store this data in your cluster.
>>>>> >>
>>>>> >> A 1T drive is now less than $100... So you might think about
>>>>> replacing
>>>>> >> you 20 GB drives by something bigger.
>>>>> >> to reply to your last question, for your data here, you will need AT
>>>>> >> LEAST 350GB overall storage. But that's a bare minimum. Don't go
>>>>> under
>>>>> >> 500GB.
>>>>> >>
>>>>> >> IMHO
>>>>> >>
>>>>> >> JM
>>>>> >>
>>>>> >> 2013/1/18, Panshul Whisper <ou...@gmail.com>:
>>>>> >> > Hello,
>>>>> >> >
>>>>> >> > I was estimating how much disk space do I need for my cluster.
>>>>> >> >
>>>>> >> > I have 24 million JSON documents approx. 5kb each
>>>>> >> > the Json is to be stored into HBASE with some identifying data in
>>>>> >> coloumns
>>>>> >> > and I also want to store the Json for later retrieval based on
>>>>> the Id
>>>>> >> data
>>>>> >> > as keys in Hbase.
>>>>> >> > I have my HDFS replication set to 3
>>>>> >> > each node has Hadoop and hbase and Ubuntu installed on it.. so
>>>>> approx
>>>>> >> > 11
>>>>> >> GB
>>>>> >> > is available for use on my 20 GB node.
>>>>> >> >
>>>>> >> > I have no idea, if I have not enabled Hbase replication, is the
>>>>> HDFS
>>>>> >> > replication enough to keep the data safe and redundant.
>>>>> >> > How much total disk space I will need for the storage of the data.
>>>>> >> >
>>>>> >> > Please help me estimate this.
>>>>> >> >
>>>>> >> > Thank you so much.
>>>>> >> >
>>>>> >> > --
>>>>> >> > Regards,
>>>>> >> > Ouch Whisper
>>>>> >> > 010101010101
>>>>> >> >
>>>>> >>
>>>>> >
>>>>> >
>>>>> >
>>>>> > --
>>>>> > Regards,
>>>>> > Ouch Whisper
>>>>> > 010101010101
>>>>> >
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Regards,
>>>> Ouch Whisper
>>>> 010101010101
>>>>
>>>>
>>
>
>
> --
> Regards,
> Ouch Whisper
> 010101010101
>

Re: Estimating disk space requirements

Posted by Panshul Whisper <ou...@gmail.com>.

ah now i understand what you mean.
I will be creating 20 individual servers on the cloud, and not create one
big server and make several virtual nodes inside it.
I will be paying for 20 different nodes.. all configured with hadoop and
connected to the cluster.

Thanx for the intel :)


On Fri, Jan 18, 2013 at 11:59 PM, Ted Dunning <td...@maprtech.com> wrote:

> It is usually better to not subdivide nodes into virtual nodes.  You will
> generally get better performance form the original node because you only
> pay for the OS once and because your disk I/O will be scheduled better.
>
> If you look at EC2 pricing, however, the spot market often has arbitrage
> opportunities where one size node is absurdly cheap relative to others.  In
> that case, it pays to scale the individual nodes up or down.
>
> The only reasonable reason to split nodes to very small levels is for
> testing and training.
>
>
> On Fri, Jan 18, 2013 at 2:30 PM, Panshul Whisper <ou...@gmail.com>wrote:
>
>> Thnx for the reply Ted,
>>
>> You can find 40 GB disks when u make virtual nodes on a cloud like
>> Rackspace ;-)
>>
>> About the os partitions I did not exactly understand what you meant.
>> I have made a server on the cloud.. And I just installed and configured
>> hadoop and hbase in the /use/local folder.
>> And I am pretty sure it does not have a separate partition for root.
>>
>> Please help me explain what u meant and what else precautions should I
>> take.
>>
>> Thanks,
>>
>> Regards,
>> Ouch Whisper
>> 01010101010
>> On Jan 18, 2013 11:11 PM, "Ted Dunning" <td...@maprtech.com> wrote:
>>
>>> Where do you find 40gb disks now a days?
>>>
>>> Normally your performance is going to be better with more space but your
>>> network may be your limiting factor for some computations.  That could give
>>> you some paradoxical scaling.  Hbase will rarely show this behavior.
>>>
>>> Keep in mind you also want to allow for an os partition. Current
>>> standard practice is to reserve as much as 100 GB for that partition but in
>>> your case 10gb better:-)
>>>
>>> Note that if you account for this, the node counts don't scale as
>>> simply.  The overhead of these os partitions goes up with number of nodes.
>>>
>>> On Jan 18, 2013, at 8:55 AM, Panshul Whisper <ou...@gmail.com>
>>> wrote:
>>>
>>> If we look at it with performance in mind,
>>> is it better to have 20 Nodes with 40 GB HDD
>>> or is it better to have 10 Nodes with 80 GB HDD?
>>>
>>> they are connected on a gigabit LAN
>>>
>>> Thnx
>>>
>>>
>>> On Fri, Jan 18, 2013 at 2:26 PM, Jean-Marc Spaggiari <
>>> jean-marc@spaggiari.org> wrote:
>>>
>>>> 20 nodes with 40 GB will do the work.
>>>>
>>>> After that you will have to consider performances based on your access
>>>> pattern. But that's another story.
>>>>
>>>> JM
>>>>
>>>> 2013/1/18, Panshul Whisper <ou...@gmail.com>:
>>>> > Thank you for the replies,
>>>> >
>>>> > So I take it that I should have atleast 800 GB on total free space on
>>>> > HDFS.. (combined free space of all the nodes connected to the
>>>> cluster). So
>>>> > I can connect 20 nodes having 40 GB of hdd on each node to my
>>>> cluster. Will
>>>> > this be enough for the storage?
>>>> > Please confirm.
>>>> >
>>>> > Thanking You,
>>>> > Regards,
>>>> > Panshul.
>>>> >
>>>> >
>>>> > On Fri, Jan 18, 2013 at 1:36 PM, Jean-Marc Spaggiari <
>>>> > jean-marc@spaggiari.org> wrote:
>>>> >
>>>> >> Hi Panshul,
>>>> >>
>>>> >> If you have 20 GB with a replication factor set to 3, you have only
>>>> >> 6.6GB available, not 11GB. You need to divide the total space by the
>>>> >> replication factor.
>>>> >>
>>>> >> Also, if you store your JSon into HBase, you need to add the key size
>>>> >> to it. If you key is 4 bytes, or 1024 bytes, it makes a difference.
>>>> >>
>>>> >> So roughly, 24 000 000 * 5 * 1024 = 114GB. You don't have the space
>>>> to
>>>> >> store it. Without including the key size. Even with a replication
>>>> >> factor set to 5 you don't have the space.
>>>> >>
>>>> >> Now, you can add some compression, but even with a lucky factor of
>>>> 50%
>>>> >> you still don't have the space. You will need something like 90%
>>>> >> compression factor to be able to store this data in your cluster.
>>>> >>
>>>> >> A 1T drive is now less than $100... So you might think about
>>>> replacing
>>>> >> you 20 GB drives by something bigger.
>>>> >> to reply to your last question, for your data here, you will need AT
>>>> >> LEAST 350GB overall storage. But that's a bare minimum. Don't go
>>>> under
>>>> >> 500GB.
>>>> >>
>>>> >> IMHO
>>>> >>
>>>> >> JM
>>>> >>
>>>> >> 2013/1/18, Panshul Whisper <ou...@gmail.com>:
>>>> >> > Hello,
>>>> >> >
>>>> >> > I was estimating how much disk space do I need for my cluster.
>>>> >> >
>>>> >> > I have 24 million JSON documents approx. 5kb each
>>>> >> > the Json is to be stored into HBASE with some identifying data in
>>>> >> coloumns
>>>> >> > and I also want to store the Json for later retrieval based on the
>>>> Id
>>>> >> data
>>>> >> > as keys in Hbase.
>>>> >> > I have my HDFS replication set to 3
>>>> >> > each node has Hadoop and hbase and Ubuntu installed on it.. so
>>>> approx
>>>> >> > 11
>>>> >> GB
>>>> >> > is available for use on my 20 GB node.
>>>> >> >
>>>> >> > I have no idea, if I have not enabled Hbase replication, is the
>>>> HDFS
>>>> >> > replication enough to keep the data safe and redundant.
>>>> >> > How much total disk space I will need for the storage of the data.
>>>> >> >
>>>> >> > Please help me estimate this.
>>>> >> >
>>>> >> > Thank you so much.
>>>> >> >
>>>> >> > --
>>>> >> > Regards,
>>>> >> > Ouch Whisper
>>>> >> > 010101010101
>>>> >> >
>>>> >>
>>>> >
>>>> >
>>>> >
>>>> > --
>>>> > Regards,
>>>> > Ouch Whisper
>>>> > 010101010101
>>>> >
>>>>
>>>
>>>
>>>
>>> --
>>> Regards,
>>> Ouch Whisper
>>> 010101010101
>>>
>>>
>


-- 
Regards,
Ouch Whisper
010101010101

Re: Estimating disk space requirements

Posted by Panshul Whisper <ou...@gmail.com>.

ah now i understand what you mean.
I will be creating 20 individual servers on the cloud, and not create one
big server and make several virtual nodes inside it.
I will be paying for 20 different nodes.. all configured with hadoop and
connected to the cluster.

Thanx for the intel :)


On Fri, Jan 18, 2013 at 11:59 PM, Ted Dunning <td...@maprtech.com> wrote:

> It is usually better to not subdivide nodes into virtual nodes.  You will
> generally get better performance form the original node because you only
> pay for the OS once and because your disk I/O will be scheduled better.
>
> If you look at EC2 pricing, however, the spot market often has arbitrage
> opportunities where one size node is absurdly cheap relative to others.  In
> that case, it pays to scale the individual nodes up or down.
>
> The only reasonable reason to split nodes to very small levels is for
> testing and training.
>
>
> On Fri, Jan 18, 2013 at 2:30 PM, Panshul Whisper <ou...@gmail.com>wrote:
>
>> Thnx for the reply Ted,
>>
>> You can find 40 GB disks when u make virtual nodes on a cloud like
>> Rackspace ;-)
>>
>> About the os partitions I did not exactly understand what you meant.
>> I have made a server on the cloud.. And I just installed and configured
>> hadoop and hbase in the /use/local folder.
>> And I am pretty sure it does not have a separate partition for root.
>>
>> Please help me explain what u meant and what else precautions should I
>> take.
>>
>> Thanks,
>>
>> Regards,
>> Ouch Whisper
>> 01010101010
>> On Jan 18, 2013 11:11 PM, "Ted Dunning" <td...@maprtech.com> wrote:
>>
>>> Where do you find 40gb disks now a days?
>>>
>>> Normally your performance is going to be better with more space but your
>>> network may be your limiting factor for some computations.  That could give
>>> you some paradoxical scaling.  Hbase will rarely show this behavior.
>>>
>>> Keep in mind you also want to allow for an os partition. Current
>>> standard practice is to reserve as much as 100 GB for that partition but in
>>> your case 10gb better:-)
>>>
>>> Note that if you account for this, the node counts don't scale as
>>> simply.  The overhead of these os partitions goes up with number of nodes.
>>>
>>> On Jan 18, 2013, at 8:55 AM, Panshul Whisper <ou...@gmail.com>
>>> wrote:
>>>
>>> If we look at it with performance in mind,
>>> is it better to have 20 Nodes with 40 GB HDD
>>> or is it better to have 10 Nodes with 80 GB HDD?
>>>
>>> they are connected on a gigabit LAN
>>>
>>> Thnx
>>>
>>>
>>> On Fri, Jan 18, 2013 at 2:26 PM, Jean-Marc Spaggiari <
>>> jean-marc@spaggiari.org> wrote:
>>>
>>>> 20 nodes with 40 GB will do the work.
>>>>
>>>> After that you will have to consider performances based on your access
>>>> pattern. But that's another story.
>>>>
>>>> JM
>>>>
>>>> 2013/1/18, Panshul Whisper <ou...@gmail.com>:
>>>> > Thank you for the replies,
>>>> >
>>>> > So I take it that I should have atleast 800 GB on total free space on
>>>> > HDFS.. (combined free space of all the nodes connected to the
>>>> cluster). So
>>>> > I can connect 20 nodes having 40 GB of hdd on each node to my
>>>> cluster. Will
>>>> > this be enough for the storage?
>>>> > Please confirm.
>>>> >
>>>> > Thanking You,
>>>> > Regards,
>>>> > Panshul.
>>>> >
>>>> >
>>>> > On Fri, Jan 18, 2013 at 1:36 PM, Jean-Marc Spaggiari <
>>>> > jean-marc@spaggiari.org> wrote:
>>>> >
>>>> >> Hi Panshul,
>>>> >>
>>>> >> If you have 20 GB with a replication factor set to 3, you have only
>>>> >> 6.6GB available, not 11GB. You need to divide the total space by the
>>>> >> replication factor.
>>>> >>
>>>> >> Also, if you store your JSon into HBase, you need to add the key size
>>>> >> to it. If you key is 4 bytes, or 1024 bytes, it makes a difference.
>>>> >>
>>>> >> So roughly, 24 000 000 * 5 * 1024 = 114GB. You don't have the space
>>>> to
>>>> >> store it. Without including the key size. Even with a replication
>>>> >> factor set to 5 you don't have the space.
>>>> >>
>>>> >> Now, you can add some compression, but even with a lucky factor of
>>>> 50%
>>>> >> you still don't have the space. You will need something like 90%
>>>> >> compression factor to be able to store this data in your cluster.
>>>> >>
>>>> >> A 1T drive is now less than $100... So you might think about
>>>> replacing
>>>> >> you 20 GB drives by something bigger.
>>>> >> to reply to your last question, for your data here, you will need AT
>>>> >> LEAST 350GB overall storage. But that's a bare minimum. Don't go
>>>> under
>>>> >> 500GB.
>>>> >>
>>>> >> IMHO
>>>> >>
>>>> >> JM
>>>> >>
>>>> >> 2013/1/18, Panshul Whisper <ou...@gmail.com>:
>>>> >> > Hello,
>>>> >> >
>>>> >> > I was estimating how much disk space do I need for my cluster.
>>>> >> >
>>>> >> > I have 24 million JSON documents approx. 5kb each
>>>> >> > the Json is to be stored into HBASE with some identifying data in
>>>> >> coloumns
>>>> >> > and I also want to store the Json for later retrieval based on the
>>>> Id
>>>> >> data
>>>> >> > as keys in Hbase.
>>>> >> > I have my HDFS replication set to 3
>>>> >> > each node has Hadoop and hbase and Ubuntu installed on it.. so
>>>> approx
>>>> >> > 11
>>>> >> GB
>>>> >> > is available for use on my 20 GB node.
>>>> >> >
>>>> >> > I have no idea, if I have not enabled Hbase replication, is the
>>>> HDFS
>>>> >> > replication enough to keep the data safe and redundant.
>>>> >> > How much total disk space I will need for the storage of the data.
>>>> >> >
>>>> >> > Please help me estimate this.
>>>> >> >
>>>> >> > Thank you so much.
>>>> >> >
>>>> >> > --
>>>> >> > Regards,
>>>> >> > Ouch Whisper
>>>> >> > 010101010101
>>>> >> >
>>>> >>
>>>> >
>>>> >
>>>> >
>>>> > --
>>>> > Regards,
>>>> > Ouch Whisper
>>>> > 010101010101
>>>> >
>>>>
>>>
>>>
>>>
>>> --
>>> Regards,
>>> Ouch Whisper
>>> 010101010101
>>>
>>>
>


-- 
Regards,
Ouch Whisper
010101010101

Re: Estimating disk space requirements

Posted by Panshul Whisper <ou...@gmail.com>.

ah now i understand what you mean.
I will be creating 20 individual servers on the cloud, and not create one
big server and make several virtual nodes inside it.
I will be paying for 20 different nodes.. all configured with hadoop and
connected to the cluster.

Thanx for the intel :)


On Fri, Jan 18, 2013 at 11:59 PM, Ted Dunning <td...@maprtech.com> wrote:

> It is usually better to not subdivide nodes into virtual nodes.  You will
> generally get better performance form the original node because you only
> pay for the OS once and because your disk I/O will be scheduled better.
>
> If you look at EC2 pricing, however, the spot market often has arbitrage
> opportunities where one size node is absurdly cheap relative to others.  In
> that case, it pays to scale the individual nodes up or down.
>
> The only reasonable reason to split nodes to very small levels is for
> testing and training.
>
>
> On Fri, Jan 18, 2013 at 2:30 PM, Panshul Whisper <ou...@gmail.com>wrote:
>
>> Thnx for the reply Ted,
>>
>> You can find 40 GB disks when u make virtual nodes on a cloud like
>> Rackspace ;-)
>>
>> About the os partitions I did not exactly understand what you meant.
>> I have made a server on the cloud.. And I just installed and configured
>> hadoop and hbase in the /use/local folder.
>> And I am pretty sure it does not have a separate partition for root.
>>
>> Please help me explain what u meant and what else precautions should I
>> take.
>>
>> Thanks,
>>
>> Regards,
>> Ouch Whisper
>> 01010101010
>> On Jan 18, 2013 11:11 PM, "Ted Dunning" <td...@maprtech.com> wrote:
>>
>>> Where do you find 40gb disks now a days?
>>>
>>> Normally your performance is going to be better with more space but your
>>> network may be your limiting factor for some computations.  That could give
>>> you some paradoxical scaling.  Hbase will rarely show this behavior.
>>>
>>> Keep in mind you also want to allow for an os partition. Current
>>> standard practice is to reserve as much as 100 GB for that partition but in
>>> your case 10gb better:-)
>>>
>>> Note that if you account for this, the node counts don't scale as
>>> simply.  The overhead of these os partitions goes up with number of nodes.
>>>
>>> On Jan 18, 2013, at 8:55 AM, Panshul Whisper <ou...@gmail.com>
>>> wrote:
>>>
>>> If we look at it with performance in mind,
>>> is it better to have 20 Nodes with 40 GB HDD
>>> or is it better to have 10 Nodes with 80 GB HDD?
>>>
>>> they are connected on a gigabit LAN
>>>
>>> Thnx
>>>
>>>
>>> On Fri, Jan 18, 2013 at 2:26 PM, Jean-Marc Spaggiari <
>>> jean-marc@spaggiari.org> wrote:
>>>
>>>> 20 nodes with 40 GB will do the work.
>>>>
>>>> After that you will have to consider performances based on your access
>>>> pattern. But that's another story.
>>>>
>>>> JM
>>>>
>>>> 2013/1/18, Panshul Whisper <ou...@gmail.com>:
>>>> > Thank you for the replies,
>>>> >
>>>> > So I take it that I should have atleast 800 GB on total free space on
>>>> > HDFS.. (combined free space of all the nodes connected to the
>>>> cluster). So
>>>> > I can connect 20 nodes having 40 GB of hdd on each node to my
>>>> cluster. Will
>>>> > this be enough for the storage?
>>>> > Please confirm.
>>>> >
>>>> > Thanking You,
>>>> > Regards,
>>>> > Panshul.
>>>> >
>>>> >
>>>> > On Fri, Jan 18, 2013 at 1:36 PM, Jean-Marc Spaggiari <
>>>> > jean-marc@spaggiari.org> wrote:
>>>> >
>>>> >> Hi Panshul,
>>>> >>
>>>> >> If you have 20 GB with a replication factor set to 3, you have only
>>>> >> 6.6GB available, not 11GB. You need to divide the total space by the
>>>> >> replication factor.
>>>> >>
>>>> >> Also, if you store your JSon into HBase, you need to add the key size
>>>> >> to it. If you key is 4 bytes, or 1024 bytes, it makes a difference.
>>>> >>
>>>> >> So roughly, 24 000 000 * 5 * 1024 = 114GB. You don't have the space
>>>> to
>>>> >> store it. Without including the key size. Even with a replication
>>>> >> factor set to 5 you don't have the space.
>>>> >>
>>>> >> Now, you can add some compression, but even with a lucky factor of
>>>> 50%
>>>> >> you still don't have the space. You will need something like 90%
>>>> >> compression factor to be able to store this data in your cluster.
>>>> >>
>>>> >> A 1T drive is now less than $100... So you might think about
>>>> replacing
>>>> >> you 20 GB drives by something bigger.
>>>> >> to reply to your last question, for your data here, you will need AT
>>>> >> LEAST 350GB overall storage. But that's a bare minimum. Don't go
>>>> under
>>>> >> 500GB.
>>>> >>
>>>> >> IMHO
>>>> >>
>>>> >> JM
>>>> >>
>>>> >> 2013/1/18, Panshul Whisper <ou...@gmail.com>:
>>>> >> > Hello,
>>>> >> >
>>>> >> > I was estimating how much disk space do I need for my cluster.
>>>> >> >
>>>> >> > I have 24 million JSON documents approx. 5kb each
>>>> >> > the Json is to be stored into HBASE with some identifying data in
>>>> >> coloumns
>>>> >> > and I also want to store the Json for later retrieval based on the
>>>> Id
>>>> >> data
>>>> >> > as keys in Hbase.
>>>> >> > I have my HDFS replication set to 3
>>>> >> > each node has Hadoop and hbase and Ubuntu installed on it.. so
>>>> approx
>>>> >> > 11
>>>> >> GB
>>>> >> > is available for use on my 20 GB node.
>>>> >> >
>>>> >> > I have no idea, if I have not enabled Hbase replication, is the
>>>> HDFS
>>>> >> > replication enough to keep the data safe and redundant.
>>>> >> > How much total disk space I will need for the storage of the data.
>>>> >> >
>>>> >> > Please help me estimate this.
>>>> >> >
>>>> >> > Thank you so much.
>>>> >> >
>>>> >> > --
>>>> >> > Regards,
>>>> >> > Ouch Whisper
>>>> >> > 010101010101
>>>> >> >
>>>> >>
>>>> >
>>>> >
>>>> >
>>>> > --
>>>> > Regards,
>>>> > Ouch Whisper
>>>> > 010101010101
>>>> >
>>>>
>>>
>>>
>>>
>>> --
>>> Regards,
>>> Ouch Whisper
>>> 010101010101
>>>
>>>
>


-- 
Regards,
Ouch Whisper
010101010101

Re: Estimating disk space requirements

Posted by Panshul Whisper <ou...@gmail.com>.

ah now i understand what you mean.
I will be creating 20 individual servers on the cloud, and not create one
big server and make several virtual nodes inside it.
I will be paying for 20 different nodes.. all configured with hadoop and
connected to the cluster.

Thanx for the intel :)


On Fri, Jan 18, 2013 at 11:59 PM, Ted Dunning <td...@maprtech.com> wrote:

> It is usually better to not subdivide nodes into virtual nodes.  You will
> generally get better performance form the original node because you only
> pay for the OS once and because your disk I/O will be scheduled better.
>
> If you look at EC2 pricing, however, the spot market often has arbitrage
> opportunities where one size node is absurdly cheap relative to others.  In
> that case, it pays to scale the individual nodes up or down.
>
> The only reasonable reason to split nodes to very small levels is for
> testing and training.
>
>
> On Fri, Jan 18, 2013 at 2:30 PM, Panshul Whisper <ou...@gmail.com>wrote:
>
>> Thnx for the reply Ted,
>>
>> You can find 40 GB disks when u make virtual nodes on a cloud like
>> Rackspace ;-)
>>
>> About the os partitions I did not exactly understand what you meant.
>> I have made a server on the cloud.. And I just installed and configured
>> hadoop and hbase in the /use/local folder.
>> And I am pretty sure it does not have a separate partition for root.
>>
>> Please help me explain what u meant and what else precautions should I
>> take.
>>
>> Thanks,
>>
>> Regards,
>> Ouch Whisper
>> 01010101010
>> On Jan 18, 2013 11:11 PM, "Ted Dunning" <td...@maprtech.com> wrote:
>>
>>> Where do you find 40gb disks now a days?
>>>
>>> Normally your performance is going to be better with more space but your
>>> network may be your limiting factor for some computations.  That could give
>>> you some paradoxical scaling.  Hbase will rarely show this behavior.
>>>
>>> Keep in mind you also want to allow for an os partition. Current
>>> standard practice is to reserve as much as 100 GB for that partition but in
>>> your case 10gb better:-)
>>>
>>> Note that if you account for this, the node counts don't scale as
>>> simply.  The overhead of these os partitions goes up with number of nodes.
>>>
>>> On Jan 18, 2013, at 8:55 AM, Panshul Whisper <ou...@gmail.com>
>>> wrote:
>>>
>>> If we look at it with performance in mind,
>>> is it better to have 20 Nodes with 40 GB HDD
>>> or is it better to have 10 Nodes with 80 GB HDD?
>>>
>>> they are connected on a gigabit LAN
>>>
>>> Thnx
>>>
>>>
>>> On Fri, Jan 18, 2013 at 2:26 PM, Jean-Marc Spaggiari <
>>> jean-marc@spaggiari.org> wrote:
>>>
>>>> 20 nodes with 40 GB will do the work.
>>>>
>>>> After that you will have to consider performances based on your access
>>>> pattern. But that's another story.
>>>>
>>>> JM
>>>>
>>>> 2013/1/18, Panshul Whisper <ou...@gmail.com>:
>>>> > Thank you for the replies,
>>>> >
>>>> > So I take it that I should have atleast 800 GB on total free space on
>>>> > HDFS.. (combined free space of all the nodes connected to the
>>>> cluster). So
>>>> > I can connect 20 nodes having 40 GB of hdd on each node to my
>>>> cluster. Will
>>>> > this be enough for the storage?
>>>> > Please confirm.
>>>> >
>>>> > Thanking You,
>>>> > Regards,
>>>> > Panshul.
>>>> >
>>>> >
>>>> > On Fri, Jan 18, 2013 at 1:36 PM, Jean-Marc Spaggiari <
>>>> > jean-marc@spaggiari.org> wrote:
>>>> >
>>>> >> Hi Panshul,
>>>> >>
>>>> >> If you have 20 GB with a replication factor set to 3, you have only
>>>> >> 6.6GB available, not 11GB. You need to divide the total space by the
>>>> >> replication factor.
>>>> >>
>>>> >> Also, if you store your JSon into HBase, you need to add the key size
>>>> >> to it. If you key is 4 bytes, or 1024 bytes, it makes a difference.
>>>> >>
>>>> >> So roughly, 24 000 000 * 5 * 1024 = 114GB. You don't have the space
>>>> to
>>>> >> store it. Without including the key size. Even with a replication
>>>> >> factor set to 5 you don't have the space.
>>>> >>
>>>> >> Now, you can add some compression, but even with a lucky factor of
>>>> 50%
>>>> >> you still don't have the space. You will need something like 90%
>>>> >> compression factor to be able to store this data in your cluster.
>>>> >>
>>>> >> A 1T drive is now less than $100... So you might think about
>>>> replacing
>>>> >> you 20 GB drives by something bigger.
>>>> >> to reply to your last question, for your data here, you will need AT
>>>> >> LEAST 350GB overall storage. But that's a bare minimum. Don't go
>>>> under
>>>> >> 500GB.
>>>> >>
>>>> >> IMHO
>>>> >>
>>>> >> JM
>>>> >>
>>>> >> 2013/1/18, Panshul Whisper <ou...@gmail.com>:
>>>> >> > Hello,
>>>> >> >
>>>> >> > I was estimating how much disk space do I need for my cluster.
>>>> >> >
>>>> >> > I have 24 million JSON documents approx. 5kb each
>>>> >> > the Json is to be stored into HBASE with some identifying data in
>>>> >> coloumns
>>>> >> > and I also want to store the Json for later retrieval based on the
>>>> Id
>>>> >> data
>>>> >> > as keys in Hbase.
>>>> >> > I have my HDFS replication set to 3
>>>> >> > each node has Hadoop and hbase and Ubuntu installed on it.. so
>>>> approx
>>>> >> > 11
>>>> >> GB
>>>> >> > is available for use on my 20 GB node.
>>>> >> >
>>>> >> > I have no idea, if I have not enabled Hbase replication, is the
>>>> HDFS
>>>> >> > replication enough to keep the data safe and redundant.
>>>> >> > How much total disk space I will need for the storage of the data.
>>>> >> >
>>>> >> > Please help me estimate this.
>>>> >> >
>>>> >> > Thank you so much.
>>>> >> >
>>>> >> > --
>>>> >> > Regards,
>>>> >> > Ouch Whisper
>>>> >> > 010101010101
>>>> >> >
>>>> >>
>>>> >
>>>> >
>>>> >
>>>> > --
>>>> > Regards,
>>>> > Ouch Whisper
>>>> > 010101010101
>>>> >
>>>>
>>>
>>>
>>>
>>> --
>>> Regards,
>>> Ouch Whisper
>>> 010101010101
>>>
>>>
>


-- 
Regards,
Ouch Whisper
010101010101

Re: Estimating disk space requirements

Posted by Ted Dunning <td...@maprtech.com>.

It is usually better to not subdivide nodes into virtual nodes.  You will
generally get better performance form the original node because you only
pay for the OS once and because your disk I/O will be scheduled better.

If you look at EC2 pricing, however, the spot market often has arbitrage
opportunities where one size node is absurdly cheap relative to others.  In
that case, it pays to scale the individual nodes up or down.

The only reasonable reason to split nodes to very small levels is for
testing and training.

On Fri, Jan 18, 2013 at 2:30 PM, Panshul Whisper <ou...@gmail.com>wrote:

> Thnx for the reply Ted,
>
> You can find 40 GB disks when u make virtual nodes on a cloud like
> Rackspace ;-)
>
> About the os partitions I did not exactly understand what you meant.
> I have made a server on the cloud.. And I just installed and configured
> hadoop and hbase in the /use/local folder.
> And I am pretty sure it does not have a separate partition for root.
>
> Please help me explain what u meant and what else precautions should I
> take.
>
> Thanks,
>
> Regards,
> Ouch Whisper
> 01010101010
> On Jan 18, 2013 11:11 PM, "Ted Dunning" <td...@maprtech.com> wrote:
>
>> Where do you find 40gb disks now a days?
>>
>> Normally your performance is going to be better with more space but your
>> network may be your limiting factor for some computations.  That could give
>> you some paradoxical scaling.  Hbase will rarely show this behavior.
>>
>> Keep in mind you also want to allow for an os partition. Current standard
>> practice is to reserve as much as 100 GB for that partition but in your
>> case 10gb better:-)
>>
>> Note that if you account for this, the node counts don't scale as simply.
>>  The overhead of these os partitions goes up with number of nodes.
>>
>> On Jan 18, 2013, at 8:55 AM, Panshul Whisper <ou...@gmail.com>
>> wrote:
>>
>> If we look at it with performance in mind,
>> is it better to have 20 Nodes with 40 GB HDD
>> or is it better to have 10 Nodes with 80 GB HDD?
>>
>> they are connected on a gigabit LAN
>>
>> Thnx
>>
>>
>> On Fri, Jan 18, 2013 at 2:26 PM, Jean-Marc Spaggiari <
>> jean-marc@spaggiari.org> wrote:
>>
>>> 20 nodes with 40 GB will do the work.
>>>
>>> After that you will have to consider performances based on your access
>>> pattern. But that's another story.
>>>
>>> JM
>>>
>>> 2013/1/18, Panshul Whisper <ou...@gmail.com>:
>>> > Thank you for the replies,
>>> >
>>> > So I take it that I should have atleast 800 GB on total free space on
>>> > HDFS.. (combined free space of all the nodes connected to the
>>> cluster). So
>>> > I can connect 20 nodes having 40 GB of hdd on each node to my cluster.
>>> Will
>>> > this be enough for the storage?
>>> > Please confirm.
>>> >
>>> > Thanking You,
>>> > Regards,
>>> > Panshul.
>>> >
>>> >
>>> > On Fri, Jan 18, 2013 at 1:36 PM, Jean-Marc Spaggiari <
>>> > jean-marc@spaggiari.org> wrote:
>>> >
>>> >> Hi Panshul,
>>> >>
>>> >> If you have 20 GB with a replication factor set to 3, you have only
>>> >> 6.6GB available, not 11GB. You need to divide the total space by the
>>> >> replication factor.
>>> >>
>>> >> Also, if you store your JSon into HBase, you need to add the key size
>>> >> to it. If you key is 4 bytes, or 1024 bytes, it makes a difference.
>>> >>
>>> >> So roughly, 24 000 000 * 5 * 1024 = 114GB. You don't have the space to
>>> >> store it. Without including the key size. Even with a replication
>>> >> factor set to 5 you don't have the space.
>>> >>
>>> >> Now, you can add some compression, but even with a lucky factor of 50%
>>> >> you still don't have the space. You will need something like 90%
>>> >> compression factor to be able to store this data in your cluster.
>>> >>
>>> >> A 1T drive is now less than $100... So you might think about replacing
>>> >> you 20 GB drives by something bigger.
>>> >> to reply to your last question, for your data here, you will need AT
>>> >> LEAST 350GB overall storage. But that's a bare minimum. Don't go under
>>> >> 500GB.
>>> >>
>>> >> IMHO
>>> >>
>>> >> JM
>>> >>
>>> >> 2013/1/18, Panshul Whisper <ou...@gmail.com>:
>>> >> > Hello,
>>> >> >
>>> >> > I was estimating how much disk space do I need for my cluster.
>>> >> >
>>> >> > I have 24 million JSON documents approx. 5kb each
>>> >> > the Json is to be stored into HBASE with some identifying data in
>>> >> coloumns
>>> >> > and I also want to store the Json for later retrieval based on the
>>> Id
>>> >> data
>>> >> > as keys in Hbase.
>>> >> > I have my HDFS replication set to 3
>>> >> > each node has Hadoop and hbase and Ubuntu installed on it.. so
>>> approx
>>> >> > 11
>>> >> GB
>>> >> > is available for use on my 20 GB node.
>>> >> >
>>> >> > I have no idea, if I have not enabled Hbase replication, is the HDFS
>>> >> > replication enough to keep the data safe and redundant.
>>> >> > How much total disk space I will need for the storage of the data.
>>> >> >
>>> >> > Please help me estimate this.
>>> >> >
>>> >> > Thank you so much.
>>> >> >
>>> >> > --
>>> >> > Regards,
>>> >> > Ouch Whisper
>>> >> > 010101010101
>>> >> >
>>> >>
>>> >
>>> >
>>> >
>>> > --
>>> > Regards,
>>> > Ouch Whisper
>>> > 010101010101
>>> >
>>>
>>
>>
>>
>> --
>> Regards,
>> Ouch Whisper
>> 010101010101
>>
>>

Re: Estimating disk space requirements

Posted by Ted Dunning <td...@maprtech.com>.

It is usually better to not subdivide nodes into virtual nodes.  You will
generally get better performance form the original node because you only
pay for the OS once and because your disk I/O will be scheduled better.

If you look at EC2 pricing, however, the spot market often has arbitrage
opportunities where one size node is absurdly cheap relative to others.  In
that case, it pays to scale the individual nodes up or down.

The only reasonable reason to split nodes to very small levels is for
testing and training.

On Fri, Jan 18, 2013 at 2:30 PM, Panshul Whisper <ou...@gmail.com>wrote:

> Thnx for the reply Ted,
>
> You can find 40 GB disks when u make virtual nodes on a cloud like
> Rackspace ;-)
>
> About the os partitions I did not exactly understand what you meant.
> I have made a server on the cloud.. And I just installed and configured
> hadoop and hbase in the /use/local folder.
> And I am pretty sure it does not have a separate partition for root.
>
> Please help me explain what u meant and what else precautions should I
> take.
>
> Thanks,
>
> Regards,
> Ouch Whisper
> 01010101010
> On Jan 18, 2013 11:11 PM, "Ted Dunning" <td...@maprtech.com> wrote:
>
>> Where do you find 40gb disks now a days?
>>
>> Normally your performance is going to be better with more space but your
>> network may be your limiting factor for some computations.  That could give
>> you some paradoxical scaling.  Hbase will rarely show this behavior.
>>
>> Keep in mind you also want to allow for an os partition. Current standard
>> practice is to reserve as much as 100 GB for that partition but in your
>> case 10gb better:-)
>>
>> Note that if you account for this, the node counts don't scale as simply.
>>  The overhead of these os partitions goes up with number of nodes.
>>
>> On Jan 18, 2013, at 8:55 AM, Panshul Whisper <ou...@gmail.com>
>> wrote:
>>
>> If we look at it with performance in mind,
>> is it better to have 20 Nodes with 40 GB HDD
>> or is it better to have 10 Nodes with 80 GB HDD?
>>
>> they are connected on a gigabit LAN
>>
>> Thnx
>>
>>
>> On Fri, Jan 18, 2013 at 2:26 PM, Jean-Marc Spaggiari <
>> jean-marc@spaggiari.org> wrote:
>>
>>> 20 nodes with 40 GB will do the work.
>>>
>>> After that you will have to consider performances based on your access
>>> pattern. But that's another story.
>>>
>>> JM
>>>
>>> 2013/1/18, Panshul Whisper <ou...@gmail.com>:
>>> > Thank you for the replies,
>>> >
>>> > So I take it that I should have atleast 800 GB on total free space on
>>> > HDFS.. (combined free space of all the nodes connected to the
>>> cluster). So
>>> > I can connect 20 nodes having 40 GB of hdd on each node to my cluster.
>>> Will
>>> > this be enough for the storage?
>>> > Please confirm.
>>> >
>>> > Thanking You,
>>> > Regards,
>>> > Panshul.
>>> >
>>> >
>>> > On Fri, Jan 18, 2013 at 1:36 PM, Jean-Marc Spaggiari <
>>> > jean-marc@spaggiari.org> wrote:
>>> >
>>> >> Hi Panshul,
>>> >>
>>> >> If you have 20 GB with a replication factor set to 3, you have only
>>> >> 6.6GB available, not 11GB. You need to divide the total space by the
>>> >> replication factor.
>>> >>
>>> >> Also, if you store your JSon into HBase, you need to add the key size
>>> >> to it. If you key is 4 bytes, or 1024 bytes, it makes a difference.
>>> >>
>>> >> So roughly, 24 000 000 * 5 * 1024 = 114GB. You don't have the space to
>>> >> store it. Without including the key size. Even with a replication
>>> >> factor set to 5 you don't have the space.
>>> >>
>>> >> Now, you can add some compression, but even with a lucky factor of 50%
>>> >> you still don't have the space. You will need something like 90%
>>> >> compression factor to be able to store this data in your cluster.
>>> >>
>>> >> A 1T drive is now less than $100... So you might think about replacing
>>> >> you 20 GB drives by something bigger.
>>> >> to reply to your last question, for your data here, you will need AT
>>> >> LEAST 350GB overall storage. But that's a bare minimum. Don't go under
>>> >> 500GB.
>>> >>
>>> >> IMHO
>>> >>
>>> >> JM
>>> >>
>>> >> 2013/1/18, Panshul Whisper <ou...@gmail.com>:
>>> >> > Hello,
>>> >> >
>>> >> > I was estimating how much disk space do I need for my cluster.
>>> >> >
>>> >> > I have 24 million JSON documents approx. 5kb each
>>> >> > the Json is to be stored into HBASE with some identifying data in
>>> >> coloumns
>>> >> > and I also want to store the Json for later retrieval based on the
>>> Id
>>> >> data
>>> >> > as keys in Hbase.
>>> >> > I have my HDFS replication set to 3
>>> >> > each node has Hadoop and hbase and Ubuntu installed on it.. so
>>> approx
>>> >> > 11
>>> >> GB
>>> >> > is available for use on my 20 GB node.
>>> >> >
>>> >> > I have no idea, if I have not enabled Hbase replication, is the HDFS
>>> >> > replication enough to keep the data safe and redundant.
>>> >> > How much total disk space I will need for the storage of the data.
>>> >> >
>>> >> > Please help me estimate this.
>>> >> >
>>> >> > Thank you so much.
>>> >> >
>>> >> > --
>>> >> > Regards,
>>> >> > Ouch Whisper
>>> >> > 010101010101
>>> >> >
>>> >>
>>> >
>>> >
>>> >
>>> > --
>>> > Regards,
>>> > Ouch Whisper
>>> > 010101010101
>>> >
>>>
>>
>>
>>
>> --
>> Regards,
>> Ouch Whisper
>> 010101010101
>>
>>

Re: Estimating disk space requirements

Posted by Ted Dunning <td...@maprtech.com>.

It is usually better to not subdivide nodes into virtual nodes.  You will
generally get better performance form the original node because you only
pay for the OS once and because your disk I/O will be scheduled better.

If you look at EC2 pricing, however, the spot market often has arbitrage
opportunities where one size node is absurdly cheap relative to others.  In
that case, it pays to scale the individual nodes up or down.

The only reasonable reason to split nodes to very small levels is for
testing and training.

On Fri, Jan 18, 2013 at 2:30 PM, Panshul Whisper <ou...@gmail.com>wrote:

> Thnx for the reply Ted,
>
> You can find 40 GB disks when u make virtual nodes on a cloud like
> Rackspace ;-)
>
> About the os partitions I did not exactly understand what you meant.
> I have made a server on the cloud.. And I just installed and configured
> hadoop and hbase in the /use/local folder.
> And I am pretty sure it does not have a separate partition for root.
>
> Please help me explain what u meant and what else precautions should I
> take.
>
> Thanks,
>
> Regards,
> Ouch Whisper
> 01010101010
> On Jan 18, 2013 11:11 PM, "Ted Dunning" <td...@maprtech.com> wrote:
>
>> Where do you find 40gb disks now a days?
>>
>> Normally your performance is going to be better with more space but your
>> network may be your limiting factor for some computations.  That could give
>> you some paradoxical scaling.  Hbase will rarely show this behavior.
>>
>> Keep in mind you also want to allow for an os partition. Current standard
>> practice is to reserve as much as 100 GB for that partition but in your
>> case 10gb better:-)
>>
>> Note that if you account for this, the node counts don't scale as simply.
>>  The overhead of these os partitions goes up with number of nodes.
>>
>> On Jan 18, 2013, at 8:55 AM, Panshul Whisper <ou...@gmail.com>
>> wrote:
>>
>> If we look at it with performance in mind,
>> is it better to have 20 Nodes with 40 GB HDD
>> or is it better to have 10 Nodes with 80 GB HDD?
>>
>> they are connected on a gigabit LAN
>>
>> Thnx
>>
>>
>> On Fri, Jan 18, 2013 at 2:26 PM, Jean-Marc Spaggiari <
>> jean-marc@spaggiari.org> wrote:
>>
>>> 20 nodes with 40 GB will do the work.
>>>
>>> After that you will have to consider performances based on your access
>>> pattern. But that's another story.
>>>
>>> JM
>>>
>>> 2013/1/18, Panshul Whisper <ou...@gmail.com>:
>>> > Thank you for the replies,
>>> >
>>> > So I take it that I should have atleast 800 GB on total free space on
>>> > HDFS.. (combined free space of all the nodes connected to the
>>> cluster). So
>>> > I can connect 20 nodes having 40 GB of hdd on each node to my cluster.
>>> Will
>>> > this be enough for the storage?
>>> > Please confirm.
>>> >
>>> > Thanking You,
>>> > Regards,
>>> > Panshul.
>>> >
>>> >
>>> > On Fri, Jan 18, 2013 at 1:36 PM, Jean-Marc Spaggiari <
>>> > jean-marc@spaggiari.org> wrote:
>>> >
>>> >> Hi Panshul,
>>> >>
>>> >> If you have 20 GB with a replication factor set to 3, you have only
>>> >> 6.6GB available, not 11GB. You need to divide the total space by the
>>> >> replication factor.
>>> >>
>>> >> Also, if you store your JSon into HBase, you need to add the key size
>>> >> to it. If you key is 4 bytes, or 1024 bytes, it makes a difference.
>>> >>
>>> >> So roughly, 24 000 000 * 5 * 1024 = 114GB. You don't have the space to
>>> >> store it. Without including the key size. Even with a replication
>>> >> factor set to 5 you don't have the space.
>>> >>
>>> >> Now, you can add some compression, but even with a lucky factor of 50%
>>> >> you still don't have the space. You will need something like 90%
>>> >> compression factor to be able to store this data in your cluster.
>>> >>
>>> >> A 1T drive is now less than $100... So you might think about replacing
>>> >> you 20 GB drives by something bigger.
>>> >> to reply to your last question, for your data here, you will need AT
>>> >> LEAST 350GB overall storage. But that's a bare minimum. Don't go under
>>> >> 500GB.
>>> >>
>>> >> IMHO
>>> >>
>>> >> JM
>>> >>
>>> >> 2013/1/18, Panshul Whisper <ou...@gmail.com>:
>>> >> > Hello,
>>> >> >
>>> >> > I was estimating how much disk space do I need for my cluster.
>>> >> >
>>> >> > I have 24 million JSON documents approx. 5kb each
>>> >> > the Json is to be stored into HBASE with some identifying data in
>>> >> coloumns
>>> >> > and I also want to store the Json for later retrieval based on the
>>> Id
>>> >> data
>>> >> > as keys in Hbase.
>>> >> > I have my HDFS replication set to 3
>>> >> > each node has Hadoop and hbase and Ubuntu installed on it.. so
>>> approx
>>> >> > 11
>>> >> GB
>>> >> > is available for use on my 20 GB node.
>>> >> >
>>> >> > I have no idea, if I have not enabled Hbase replication, is the HDFS
>>> >> > replication enough to keep the data safe and redundant.
>>> >> > How much total disk space I will need for the storage of the data.
>>> >> >
>>> >> > Please help me estimate this.
>>> >> >
>>> >> > Thank you so much.
>>> >> >
>>> >> > --
>>> >> > Regards,
>>> >> > Ouch Whisper
>>> >> > 010101010101
>>> >> >
>>> >>
>>> >
>>> >
>>> >
>>> > --
>>> > Regards,
>>> > Ouch Whisper
>>> > 010101010101
>>> >
>>>
>>
>>
>>
>> --
>> Regards,
>> Ouch Whisper
>> 010101010101
>>
>>

Re: Estimating disk space requirements

Posted by Mohammad Tariq <do...@gmail.com>.

You can attach a separate disk to your instance (for example an
EBS volume in case of AWS), where you will be storing only
Hadoop related stuff. And one disk for OS related stuff.

Warm Regards,
Tariq
https://mtariq.jux.com/
cloudfront.blogspot.com


On Sat, Jan 19, 2013 at 4:00 AM, Panshul Whisper <ou...@gmail.com>wrote:

> Thnx for the reply Ted,
>
> You can find 40 GB disks when u make virtual nodes on a cloud like
> Rackspace ;-)
>
> About the os partitions I did not exactly understand what you meant.
> I have made a server on the cloud.. And I just installed and configured
> hadoop and hbase in the /use/local folder.
> And I am pretty sure it does not have a separate partition for root.
>
> Please help me explain what u meant and what else precautions should I
> take.
>
> Thanks,
>
> Regards,
> Ouch Whisper
> 01010101010
> On Jan 18, 2013 11:11 PM, "Ted Dunning" <td...@maprtech.com> wrote:
>
>> Where do you find 40gb disks now a days?
>>
>> Normally your performance is going to be better with more space but your
>> network may be your limiting factor for some computations.  That could give
>> you some paradoxical scaling.  Hbase will rarely show this behavior.
>>
>> Keep in mind you also want to allow for an os partition. Current standard
>> practice is to reserve as much as 100 GB for that partition but in your
>> case 10gb better:-)
>>
>> Note that if you account for this, the node counts don't scale as simply.
>>  The overhead of these os partitions goes up with number of nodes.
>>
>> On Jan 18, 2013, at 8:55 AM, Panshul Whisper <ou...@gmail.com>
>> wrote:
>>
>> If we look at it with performance in mind,
>> is it better to have 20 Nodes with 40 GB HDD
>> or is it better to have 10 Nodes with 80 GB HDD?
>>
>> they are connected on a gigabit LAN
>>
>> Thnx
>>
>>
>> On Fri, Jan 18, 2013 at 2:26 PM, Jean-Marc Spaggiari <
>> jean-marc@spaggiari.org> wrote:
>>
>>> 20 nodes with 40 GB will do the work.
>>>
>>> After that you will have to consider performances based on your access
>>> pattern. But that's another story.
>>>
>>> JM
>>>
>>> 2013/1/18, Panshul Whisper <ou...@gmail.com>:
>>> > Thank you for the replies,
>>> >
>>> > So I take it that I should have atleast 800 GB on total free space on
>>> > HDFS.. (combined free space of all the nodes connected to the
>>> cluster). So
>>> > I can connect 20 nodes having 40 GB of hdd on each node to my cluster.
>>> Will
>>> > this be enough for the storage?
>>> > Please confirm.
>>> >
>>> > Thanking You,
>>> > Regards,
>>> > Panshul.
>>> >
>>> >
>>> > On Fri, Jan 18, 2013 at 1:36 PM, Jean-Marc Spaggiari <
>>> > jean-marc@spaggiari.org> wrote:
>>> >
>>> >> Hi Panshul,
>>> >>
>>> >> If you have 20 GB with a replication factor set to 3, you have only
>>> >> 6.6GB available, not 11GB. You need to divide the total space by the
>>> >> replication factor.
>>> >>
>>> >> Also, if you store your JSon into HBase, you need to add the key size
>>> >> to it. If you key is 4 bytes, or 1024 bytes, it makes a difference.
>>> >>
>>> >> So roughly, 24 000 000 * 5 * 1024 = 114GB. You don't have the space to
>>> >> store it. Without including the key size. Even with a replication
>>> >> factor set to 5 you don't have the space.
>>> >>
>>> >> Now, you can add some compression, but even with a lucky factor of 50%
>>> >> you still don't have the space. You will need something like 90%
>>> >> compression factor to be able to store this data in your cluster.
>>> >>
>>> >> A 1T drive is now less than $100... So you might think about replacing
>>> >> you 20 GB drives by something bigger.
>>> >> to reply to your last question, for your data here, you will need AT
>>> >> LEAST 350GB overall storage. But that's a bare minimum. Don't go under
>>> >> 500GB.
>>> >>
>>> >> IMHO
>>> >>
>>> >> JM
>>> >>
>>> >> 2013/1/18, Panshul Whisper <ou...@gmail.com>:
>>> >> > Hello,
>>> >> >
>>> >> > I was estimating how much disk space do I need for my cluster.
>>> >> >
>>> >> > I have 24 million JSON documents approx. 5kb each
>>> >> > the Json is to be stored into HBASE with some identifying data in
>>> >> coloumns
>>> >> > and I also want to store the Json for later retrieval based on the
>>> Id
>>> >> data
>>> >> > as keys in Hbase.
>>> >> > I have my HDFS replication set to 3
>>> >> > each node has Hadoop and hbase and Ubuntu installed on it.. so
>>> approx
>>> >> > 11
>>> >> GB
>>> >> > is available for use on my 20 GB node.
>>> >> >
>>> >> > I have no idea, if I have not enabled Hbase replication, is the HDFS
>>> >> > replication enough to keep the data safe and redundant.
>>> >> > How much total disk space I will need for the storage of the data.
>>> >> >
>>> >> > Please help me estimate this.
>>> >> >
>>> >> > Thank you so much.
>>> >> >
>>> >> > --
>>> >> > Regards,
>>> >> > Ouch Whisper
>>> >> > 010101010101
>>> >> >
>>> >>
>>> >
>>> >
>>> >
>>> > --
>>> > Regards,
>>> > Ouch Whisper
>>> > 010101010101
>>> >
>>>
>>
>>
>>
>> --
>> Regards,
>> Ouch Whisper
>> 010101010101
>>
>>

Re: Estimating disk space requirements

Posted by Ted Dunning <td...@maprtech.com>.

It is usually better to not subdivide nodes into virtual nodes.  You will
generally get better performance form the original node because you only
pay for the OS once and because your disk I/O will be scheduled better.

If you look at EC2 pricing, however, the spot market often has arbitrage
opportunities where one size node is absurdly cheap relative to others.  In
that case, it pays to scale the individual nodes up or down.

The only reasonable reason to split nodes to very small levels is for
testing and training.

On Fri, Jan 18, 2013 at 2:30 PM, Panshul Whisper <ou...@gmail.com>wrote:

> Thnx for the reply Ted,
>
> You can find 40 GB disks when u make virtual nodes on a cloud like
> Rackspace ;-)
>
> About the os partitions I did not exactly understand what you meant.
> I have made a server on the cloud.. And I just installed and configured
> hadoop and hbase in the /use/local folder.
> And I am pretty sure it does not have a separate partition for root.
>
> Please help me explain what u meant and what else precautions should I
> take.
>
> Thanks,
>
> Regards,
> Ouch Whisper
> 01010101010
> On Jan 18, 2013 11:11 PM, "Ted Dunning" <td...@maprtech.com> wrote:
>
>> Where do you find 40gb disks now a days?
>>
>> Normally your performance is going to be better with more space but your
>> network may be your limiting factor for some computations.  That could give
>> you some paradoxical scaling.  Hbase will rarely show this behavior.
>>
>> Keep in mind you also want to allow for an os partition. Current standard
>> practice is to reserve as much as 100 GB for that partition but in your
>> case 10gb better:-)
>>
>> Note that if you account for this, the node counts don't scale as simply.
>>  The overhead of these os partitions goes up with number of nodes.
>>
>> On Jan 18, 2013, at 8:55 AM, Panshul Whisper <ou...@gmail.com>
>> wrote:
>>
>> If we look at it with performance in mind,
>> is it better to have 20 Nodes with 40 GB HDD
>> or is it better to have 10 Nodes with 80 GB HDD?
>>
>> they are connected on a gigabit LAN
>>
>> Thnx
>>
>>
>> On Fri, Jan 18, 2013 at 2:26 PM, Jean-Marc Spaggiari <
>> jean-marc@spaggiari.org> wrote:
>>
>>> 20 nodes with 40 GB will do the work.
>>>
>>> After that you will have to consider performances based on your access
>>> pattern. But that's another story.
>>>
>>> JM
>>>
>>> 2013/1/18, Panshul Whisper <ou...@gmail.com>:
>>> > Thank you for the replies,
>>> >
>>> > So I take it that I should have atleast 800 GB on total free space on
>>> > HDFS.. (combined free space of all the nodes connected to the
>>> cluster). So
>>> > I can connect 20 nodes having 40 GB of hdd on each node to my cluster.
>>> Will
>>> > this be enough for the storage?
>>> > Please confirm.
>>> >
>>> > Thanking You,
>>> > Regards,
>>> > Panshul.
>>> >
>>> >
>>> > On Fri, Jan 18, 2013 at 1:36 PM, Jean-Marc Spaggiari <
>>> > jean-marc@spaggiari.org> wrote:
>>> >
>>> >> Hi Panshul,
>>> >>
>>> >> If you have 20 GB with a replication factor set to 3, you have only
>>> >> 6.6GB available, not 11GB. You need to divide the total space by the
>>> >> replication factor.
>>> >>
>>> >> Also, if you store your JSon into HBase, you need to add the key size
>>> >> to it. If you key is 4 bytes, or 1024 bytes, it makes a difference.
>>> >>
>>> >> So roughly, 24 000 000 * 5 * 1024 = 114GB. You don't have the space to
>>> >> store it. Without including the key size. Even with a replication
>>> >> factor set to 5 you don't have the space.
>>> >>
>>> >> Now, you can add some compression, but even with a lucky factor of 50%
>>> >> you still don't have the space. You will need something like 90%
>>> >> compression factor to be able to store this data in your cluster.
>>> >>
>>> >> A 1T drive is now less than $100... So you might think about replacing
>>> >> you 20 GB drives by something bigger.
>>> >> to reply to your last question, for your data here, you will need AT
>>> >> LEAST 350GB overall storage. But that's a bare minimum. Don't go under
>>> >> 500GB.
>>> >>
>>> >> IMHO
>>> >>
>>> >> JM
>>> >>
>>> >> 2013/1/18, Panshul Whisper <ou...@gmail.com>:
>>> >> > Hello,
>>> >> >
>>> >> > I was estimating how much disk space do I need for my cluster.
>>> >> >
>>> >> > I have 24 million JSON documents approx. 5kb each
>>> >> > the Json is to be stored into HBASE with some identifying data in
>>> >> coloumns
>>> >> > and I also want to store the Json for later retrieval based on the
>>> Id
>>> >> data
>>> >> > as keys in Hbase.
>>> >> > I have my HDFS replication set to 3
>>> >> > each node has Hadoop and hbase and Ubuntu installed on it.. so
>>> approx
>>> >> > 11
>>> >> GB
>>> >> > is available for use on my 20 GB node.
>>> >> >
>>> >> > I have no idea, if I have not enabled Hbase replication, is the HDFS
>>> >> > replication enough to keep the data safe and redundant.
>>> >> > How much total disk space I will need for the storage of the data.
>>> >> >
>>> >> > Please help me estimate this.
>>> >> >
>>> >> > Thank you so much.
>>> >> >
>>> >> > --
>>> >> > Regards,
>>> >> > Ouch Whisper
>>> >> > 010101010101
>>> >> >
>>> >>
>>> >
>>> >
>>> >
>>> > --
>>> > Regards,
>>> > Ouch Whisper
>>> > 010101010101
>>> >
>>>
>>
>>
>>
>> --
>> Regards,
>> Ouch Whisper
>> 010101010101
>>
>>

Re: Estimating disk space requirements

Posted by Mohammad Tariq <do...@gmail.com>.

You can attach a separate disk to your instance (for example an
EBS volume in case of AWS), where you will be storing only
Hadoop related stuff. And one disk for OS related stuff.

Warm Regards,
Tariq
https://mtariq.jux.com/
cloudfront.blogspot.com


On Sat, Jan 19, 2013 at 4:00 AM, Panshul Whisper <ou...@gmail.com>wrote:

> Thnx for the reply Ted,
>
> You can find 40 GB disks when u make virtual nodes on a cloud like
> Rackspace ;-)
>
> About the os partitions I did not exactly understand what you meant.
> I have made a server on the cloud.. And I just installed and configured
> hadoop and hbase in the /use/local folder.
> And I am pretty sure it does not have a separate partition for root.
>
> Please help me explain what u meant and what else precautions should I
> take.
>
> Thanks,
>
> Regards,
> Ouch Whisper
> 01010101010
> On Jan 18, 2013 11:11 PM, "Ted Dunning" <td...@maprtech.com> wrote:
>
>> Where do you find 40gb disks now a days?
>>
>> Normally your performance is going to be better with more space but your
>> network may be your limiting factor for some computations.  That could give
>> you some paradoxical scaling.  Hbase will rarely show this behavior.
>>
>> Keep in mind you also want to allow for an os partition. Current standard
>> practice is to reserve as much as 100 GB for that partition but in your
>> case 10gb better:-)
>>
>> Note that if you account for this, the node counts don't scale as simply.
>>  The overhead of these os partitions goes up with number of nodes.
>>
>> On Jan 18, 2013, at 8:55 AM, Panshul Whisper <ou...@gmail.com>
>> wrote:
>>
>> If we look at it with performance in mind,
>> is it better to have 20 Nodes with 40 GB HDD
>> or is it better to have 10 Nodes with 80 GB HDD?
>>
>> they are connected on a gigabit LAN
>>
>> Thnx
>>
>>
>> On Fri, Jan 18, 2013 at 2:26 PM, Jean-Marc Spaggiari <
>> jean-marc@spaggiari.org> wrote:
>>
>>> 20 nodes with 40 GB will do the work.
>>>
>>> After that you will have to consider performances based on your access
>>> pattern. But that's another story.
>>>
>>> JM
>>>
>>> 2013/1/18, Panshul Whisper <ou...@gmail.com>:
>>> > Thank you for the replies,
>>> >
>>> > So I take it that I should have atleast 800 GB on total free space on
>>> > HDFS.. (combined free space of all the nodes connected to the
>>> cluster). So
>>> > I can connect 20 nodes having 40 GB of hdd on each node to my cluster.
>>> Will
>>> > this be enough for the storage?
>>> > Please confirm.
>>> >
>>> > Thanking You,
>>> > Regards,
>>> > Panshul.
>>> >
>>> >
>>> > On Fri, Jan 18, 2013 at 1:36 PM, Jean-Marc Spaggiari <
>>> > jean-marc@spaggiari.org> wrote:
>>> >
>>> >> Hi Panshul,
>>> >>
>>> >> If you have 20 GB with a replication factor set to 3, you have only
>>> >> 6.6GB available, not 11GB. You need to divide the total space by the
>>> >> replication factor.
>>> >>
>>> >> Also, if you store your JSon into HBase, you need to add the key size
>>> >> to it. If you key is 4 bytes, or 1024 bytes, it makes a difference.
>>> >>
>>> >> So roughly, 24 000 000 * 5 * 1024 = 114GB. You don't have the space to
>>> >> store it. Without including the key size. Even with a replication
>>> >> factor set to 5 you don't have the space.
>>> >>
>>> >> Now, you can add some compression, but even with a lucky factor of 50%
>>> >> you still don't have the space. You will need something like 90%
>>> >> compression factor to be able to store this data in your cluster.
>>> >>
>>> >> A 1T drive is now less than $100... So you might think about replacing
>>> >> you 20 GB drives by something bigger.
>>> >> to reply to your last question, for your data here, you will need AT
>>> >> LEAST 350GB overall storage. But that's a bare minimum. Don't go under
>>> >> 500GB.
>>> >>
>>> >> IMHO
>>> >>
>>> >> JM
>>> >>
>>> >> 2013/1/18, Panshul Whisper <ou...@gmail.com>:
>>> >> > Hello,
>>> >> >
>>> >> > I was estimating how much disk space do I need for my cluster.
>>> >> >
>>> >> > I have 24 million JSON documents approx. 5kb each
>>> >> > the Json is to be stored into HBASE with some identifying data in
>>> >> coloumns
>>> >> > and I also want to store the Json for later retrieval based on the
>>> Id
>>> >> data
>>> >> > as keys in Hbase.
>>> >> > I have my HDFS replication set to 3
>>> >> > each node has Hadoop and hbase and Ubuntu installed on it.. so
>>> approx
>>> >> > 11
>>> >> GB
>>> >> > is available for use on my 20 GB node.
>>> >> >
>>> >> > I have no idea, if I have not enabled Hbase replication, is the HDFS
>>> >> > replication enough to keep the data safe and redundant.
>>> >> > How much total disk space I will need for the storage of the data.
>>> >> >
>>> >> > Please help me estimate this.
>>> >> >
>>> >> > Thank you so much.
>>> >> >
>>> >> > --
>>> >> > Regards,
>>> >> > Ouch Whisper
>>> >> > 010101010101
>>> >> >
>>> >>
>>> >
>>> >
>>> >
>>> > --
>>> > Regards,
>>> > Ouch Whisper
>>> > 010101010101
>>> >
>>>
>>
>>
>>
>> --
>> Regards,
>> Ouch Whisper
>> 010101010101
>>
>>

Re: Estimating disk space requirements

Posted by Mohammad Tariq <do...@gmail.com>.

You can attach a separate disk to your instance (for example an
EBS volume in case of AWS), where you will be storing only
Hadoop related stuff. And one disk for OS related stuff.

Warm Regards,
Tariq
https://mtariq.jux.com/
cloudfront.blogspot.com


On Sat, Jan 19, 2013 at 4:00 AM, Panshul Whisper <ou...@gmail.com>wrote:

> Thnx for the reply Ted,
>
> You can find 40 GB disks when u make virtual nodes on a cloud like
> Rackspace ;-)
>
> About the os partitions I did not exactly understand what you meant.
> I have made a server on the cloud.. And I just installed and configured
> hadoop and hbase in the /use/local folder.
> And I am pretty sure it does not have a separate partition for root.
>
> Please help me explain what u meant and what else precautions should I
> take.
>
> Thanks,
>
> Regards,
> Ouch Whisper
> 01010101010
> On Jan 18, 2013 11:11 PM, "Ted Dunning" <td...@maprtech.com> wrote:
>
>> Where do you find 40gb disks now a days?
>>
>> Normally your performance is going to be better with more space but your
>> network may be your limiting factor for some computations.  That could give
>> you some paradoxical scaling.  Hbase will rarely show this behavior.
>>
>> Keep in mind you also want to allow for an os partition. Current standard
>> practice is to reserve as much as 100 GB for that partition but in your
>> case 10gb better:-)
>>
>> Note that if you account for this, the node counts don't scale as simply.
>>  The overhead of these os partitions goes up with number of nodes.
>>
>> On Jan 18, 2013, at 8:55 AM, Panshul Whisper <ou...@gmail.com>
>> wrote:
>>
>> If we look at it with performance in mind,
>> is it better to have 20 Nodes with 40 GB HDD
>> or is it better to have 10 Nodes with 80 GB HDD?
>>
>> they are connected on a gigabit LAN
>>
>> Thnx
>>
>>
>> On Fri, Jan 18, 2013 at 2:26 PM, Jean-Marc Spaggiari <
>> jean-marc@spaggiari.org> wrote:
>>
>>> 20 nodes with 40 GB will do the work.
>>>
>>> After that you will have to consider performances based on your access
>>> pattern. But that's another story.
>>>
>>> JM
>>>
>>> 2013/1/18, Panshul Whisper <ou...@gmail.com>:
>>> > Thank you for the replies,
>>> >
>>> > So I take it that I should have atleast 800 GB on total free space on
>>> > HDFS.. (combined free space of all the nodes connected to the
>>> cluster). So
>>> > I can connect 20 nodes having 40 GB of hdd on each node to my cluster.
>>> Will
>>> > this be enough for the storage?
>>> > Please confirm.
>>> >
>>> > Thanking You,
>>> > Regards,
>>> > Panshul.
>>> >
>>> >
>>> > On Fri, Jan 18, 2013 at 1:36 PM, Jean-Marc Spaggiari <
>>> > jean-marc@spaggiari.org> wrote:
>>> >
>>> >> Hi Panshul,
>>> >>
>>> >> If you have 20 GB with a replication factor set to 3, you have only
>>> >> 6.6GB available, not 11GB. You need to divide the total space by the
>>> >> replication factor.
>>> >>
>>> >> Also, if you store your JSon into HBase, you need to add the key size
>>> >> to it. If you key is 4 bytes, or 1024 bytes, it makes a difference.
>>> >>
>>> >> So roughly, 24 000 000 * 5 * 1024 = 114GB. You don't have the space to
>>> >> store it. Without including the key size. Even with a replication
>>> >> factor set to 5 you don't have the space.
>>> >>
>>> >> Now, you can add some compression, but even with a lucky factor of 50%
>>> >> you still don't have the space. You will need something like 90%
>>> >> compression factor to be able to store this data in your cluster.
>>> >>
>>> >> A 1T drive is now less than $100... So you might think about replacing
>>> >> you 20 GB drives by something bigger.
>>> >> to reply to your last question, for your data here, you will need AT
>>> >> LEAST 350GB overall storage. But that's a bare minimum. Don't go under
>>> >> 500GB.
>>> >>
>>> >> IMHO
>>> >>
>>> >> JM
>>> >>
>>> >> 2013/1/18, Panshul Whisper <ou...@gmail.com>:
>>> >> > Hello,
>>> >> >
>>> >> > I was estimating how much disk space do I need for my cluster.
>>> >> >
>>> >> > I have 24 million JSON documents approx. 5kb each
>>> >> > the Json is to be stored into HBASE with some identifying data in
>>> >> coloumns
>>> >> > and I also want to store the Json for later retrieval based on the
>>> Id
>>> >> data
>>> >> > as keys in Hbase.
>>> >> > I have my HDFS replication set to 3
>>> >> > each node has Hadoop and hbase and Ubuntu installed on it.. so
>>> approx
>>> >> > 11
>>> >> GB
>>> >> > is available for use on my 20 GB node.
>>> >> >
>>> >> > I have no idea, if I have not enabled Hbase replication, is the HDFS
>>> >> > replication enough to keep the data safe and redundant.
>>> >> > How much total disk space I will need for the storage of the data.
>>> >> >
>>> >> > Please help me estimate this.
>>> >> >
>>> >> > Thank you so much.
>>> >> >
>>> >> > --
>>> >> > Regards,
>>> >> > Ouch Whisper
>>> >> > 010101010101
>>> >> >
>>> >>
>>> >
>>> >
>>> >
>>> > --
>>> > Regards,
>>> > Ouch Whisper
>>> > 010101010101
>>> >
>>>
>>
>>
>>
>> --
>> Regards,
>> Ouch Whisper
>> 010101010101
>>
>>

Re: Estimating disk space requirements

Posted by Panshul Whisper <ou...@gmail.com>.

Thnx for the reply Ted,

You can find 40 GB disks when u make virtual nodes on a cloud like
Rackspace ;-)

About the os partitions I did not exactly understand what you meant.
I have made a server on the cloud.. And I just installed and configured
hadoop and hbase in the /use/local folder.
And I am pretty sure it does not have a separate partition for root.

Please help me explain what u meant and what else precautions should I take.

Thanks,

Regards,
Ouch Whisper
01010101010
On Jan 18, 2013 11:11 PM, "Ted Dunning" <td...@maprtech.com> wrote:

> Where do you find 40gb disks now a days?
>
> Normally your performance is going to be better with more space but your
> network may be your limiting factor for some computations.  That could give
> you some paradoxical scaling.  Hbase will rarely show this behavior.
>
> Keep in mind you also want to allow for an os partition. Current standard
> practice is to reserve as much as 100 GB for that partition but in your
> case 10gb better:-)
>
> Note that if you account for this, the node counts don't scale as simply.
>  The overhead of these os partitions goes up with number of nodes.
>
> On Jan 18, 2013, at 8:55 AM, Panshul Whisper <ou...@gmail.com>
> wrote:
>
> If we look at it with performance in mind,
> is it better to have 20 Nodes with 40 GB HDD
> or is it better to have 10 Nodes with 80 GB HDD?
>
> they are connected on a gigabit LAN
>
> Thnx
>
>
> On Fri, Jan 18, 2013 at 2:26 PM, Jean-Marc Spaggiari <
> jean-marc@spaggiari.org> wrote:
>
>> 20 nodes with 40 GB will do the work.
>>
>> After that you will have to consider performances based on your access
>> pattern. But that's another story.
>>
>> JM
>>
>> 2013/1/18, Panshul Whisper <ou...@gmail.com>:
>> > Thank you for the replies,
>> >
>> > So I take it that I should have atleast 800 GB on total free space on
>> > HDFS.. (combined free space of all the nodes connected to the cluster).
>> So
>> > I can connect 20 nodes having 40 GB of hdd on each node to my cluster.
>> Will
>> > this be enough for the storage?
>> > Please confirm.
>> >
>> > Thanking You,
>> > Regards,
>> > Panshul.
>> >
>> >
>> > On Fri, Jan 18, 2013 at 1:36 PM, Jean-Marc Spaggiari <
>> > jean-marc@spaggiari.org> wrote:
>> >
>> >> Hi Panshul,
>> >>
>> >> If you have 20 GB with a replication factor set to 3, you have only
>> >> 6.6GB available, not 11GB. You need to divide the total space by the
>> >> replication factor.
>> >>
>> >> Also, if you store your JSon into HBase, you need to add the key size
>> >> to it. If you key is 4 bytes, or 1024 bytes, it makes a difference.
>> >>
>> >> So roughly, 24 000 000 * 5 * 1024 = 114GB. You don't have the space to
>> >> store it. Without including the key size. Even with a replication
>> >> factor set to 5 you don't have the space.
>> >>
>> >> Now, you can add some compression, but even with a lucky factor of 50%
>> >> you still don't have the space. You will need something like 90%
>> >> compression factor to be able to store this data in your cluster.
>> >>
>> >> A 1T drive is now less than $100... So you might think about replacing
>> >> you 20 GB drives by something bigger.
>> >> to reply to your last question, for your data here, you will need AT
>> >> LEAST 350GB overall storage. But that's a bare minimum. Don't go under
>> >> 500GB.
>> >>
>> >> IMHO
>> >>
>> >> JM
>> >>
>> >> 2013/1/18, Panshul Whisper <ou...@gmail.com>:
>> >> > Hello,
>> >> >
>> >> > I was estimating how much disk space do I need for my cluster.
>> >> >
>> >> > I have 24 million JSON documents approx. 5kb each
>> >> > the Json is to be stored into HBASE with some identifying data in
>> >> coloumns
>> >> > and I also want to store the Json for later retrieval based on the Id
>> >> data
>> >> > as keys in Hbase.
>> >> > I have my HDFS replication set to 3
>> >> > each node has Hadoop and hbase and Ubuntu installed on it.. so approx
>> >> > 11
>> >> GB
>> >> > is available for use on my 20 GB node.
>> >> >
>> >> > I have no idea, if I have not enabled Hbase replication, is the HDFS
>> >> > replication enough to keep the data safe and redundant.
>> >> > How much total disk space I will need for the storage of the data.
>> >> >
>> >> > Please help me estimate this.
>> >> >
>> >> > Thank you so much.
>> >> >
>> >> > --
>> >> > Regards,
>> >> > Ouch Whisper
>> >> > 010101010101
>> >> >
>> >>
>> >
>> >
>> >
>> > --
>> > Regards,
>> > Ouch Whisper
>> > 010101010101
>> >
>>
>
>
>
> --
> Regards,
> Ouch Whisper
> 010101010101
>
>

Re: Estimating disk space requirements

Posted by Panshul Whisper <ou...@gmail.com>.

Thnx for the reply Ted,

You can find 40 GB disks when u make virtual nodes on a cloud like
Rackspace ;-)

About the os partitions I did not exactly understand what you meant.
I have made a server on the cloud.. And I just installed and configured
hadoop and hbase in the /use/local folder.
And I am pretty sure it does not have a separate partition for root.

Please help me explain what u meant and what else precautions should I take.

Thanks,

Regards,
Ouch Whisper
01010101010
On Jan 18, 2013 11:11 PM, "Ted Dunning" <td...@maprtech.com> wrote:

> Where do you find 40gb disks now a days?
>
> Normally your performance is going to be better with more space but your
> network may be your limiting factor for some computations.  That could give
> you some paradoxical scaling.  Hbase will rarely show this behavior.
>
> Keep in mind you also want to allow for an os partition. Current standard
> practice is to reserve as much as 100 GB for that partition but in your
> case 10gb better:-)
>
> Note that if you account for this, the node counts don't scale as simply.
>  The overhead of these os partitions goes up with number of nodes.
>
> On Jan 18, 2013, at 8:55 AM, Panshul Whisper <ou...@gmail.com>
> wrote:
>
> If we look at it with performance in mind,
> is it better to have 20 Nodes with 40 GB HDD
> or is it better to have 10 Nodes with 80 GB HDD?
>
> they are connected on a gigabit LAN
>
> Thnx
>
>
> On Fri, Jan 18, 2013 at 2:26 PM, Jean-Marc Spaggiari <
> jean-marc@spaggiari.org> wrote:
>
>> 20 nodes with 40 GB will do the work.
>>
>> After that you will have to consider performances based on your access
>> pattern. But that's another story.
>>
>> JM
>>
>> 2013/1/18, Panshul Whisper <ou...@gmail.com>:
>> > Thank you for the replies,
>> >
>> > So I take it that I should have atleast 800 GB on total free space on
>> > HDFS.. (combined free space of all the nodes connected to the cluster).
>> So
>> > I can connect 20 nodes having 40 GB of hdd on each node to my cluster.
>> Will
>> > this be enough for the storage?
>> > Please confirm.
>> >
>> > Thanking You,
>> > Regards,
>> > Panshul.
>> >
>> >
>> > On Fri, Jan 18, 2013 at 1:36 PM, Jean-Marc Spaggiari <
>> > jean-marc@spaggiari.org> wrote:
>> >
>> >> Hi Panshul,
>> >>
>> >> If you have 20 GB with a replication factor set to 3, you have only
>> >> 6.6GB available, not 11GB. You need to divide the total space by the
>> >> replication factor.
>> >>
>> >> Also, if you store your JSon into HBase, you need to add the key size
>> >> to it. If you key is 4 bytes, or 1024 bytes, it makes a difference.
>> >>
>> >> So roughly, 24 000 000 * 5 * 1024 = 114GB. You don't have the space to
>> >> store it. Without including the key size. Even with a replication
>> >> factor set to 5 you don't have the space.
>> >>
>> >> Now, you can add some compression, but even with a lucky factor of 50%
>> >> you still don't have the space. You will need something like 90%
>> >> compression factor to be able to store this data in your cluster.
>> >>
>> >> A 1T drive is now less than $100... So you might think about replacing
>> >> you 20 GB drives by something bigger.
>> >> to reply to your last question, for your data here, you will need AT
>> >> LEAST 350GB overall storage. But that's a bare minimum. Don't go under
>> >> 500GB.
>> >>
>> >> IMHO
>> >>
>> >> JM
>> >>
>> >> 2013/1/18, Panshul Whisper <ou...@gmail.com>:
>> >> > Hello,
>> >> >
>> >> > I was estimating how much disk space do I need for my cluster.
>> >> >
>> >> > I have 24 million JSON documents approx. 5kb each
>> >> > the Json is to be stored into HBASE with some identifying data in
>> >> coloumns
>> >> > and I also want to store the Json for later retrieval based on the Id
>> >> data
>> >> > as keys in Hbase.
>> >> > I have my HDFS replication set to 3
>> >> > each node has Hadoop and hbase and Ubuntu installed on it.. so approx
>> >> > 11
>> >> GB
>> >> > is available for use on my 20 GB node.
>> >> >
>> >> > I have no idea, if I have not enabled Hbase replication, is the HDFS
>> >> > replication enough to keep the data safe and redundant.
>> >> > How much total disk space I will need for the storage of the data.
>> >> >
>> >> > Please help me estimate this.
>> >> >
>> >> > Thank you so much.
>> >> >
>> >> > --
>> >> > Regards,
>> >> > Ouch Whisper
>> >> > 010101010101
>> >> >
>> >>
>> >
>> >
>> >
>> > --
>> > Regards,
>> > Ouch Whisper
>> > 010101010101
>> >
>>
>
>
>
> --
> Regards,
> Ouch Whisper
> 010101010101
>
>

Re: Estimating disk space requirements

Posted by Panshul Whisper <ou...@gmail.com>.

Thnx for the reply Ted,

You can find 40 GB disks when u make virtual nodes on a cloud like
Rackspace ;-)

About the os partitions I did not exactly understand what you meant.
I have made a server on the cloud.. And I just installed and configured
hadoop and hbase in the /use/local folder.
And I am pretty sure it does not have a separate partition for root.

Please help me explain what u meant and what else precautions should I take.

Thanks,

Regards,
Ouch Whisper
01010101010
On Jan 18, 2013 11:11 PM, "Ted Dunning" <td...@maprtech.com> wrote:

> Where do you find 40gb disks now a days?
>
> Normally your performance is going to be better with more space but your
> network may be your limiting factor for some computations.  That could give
> you some paradoxical scaling.  Hbase will rarely show this behavior.
>
> Keep in mind you also want to allow for an os partition. Current standard
> practice is to reserve as much as 100 GB for that partition but in your
> case 10gb better:-)
>
> Note that if you account for this, the node counts don't scale as simply.
>  The overhead of these os partitions goes up with number of nodes.
>
> On Jan 18, 2013, at 8:55 AM, Panshul Whisper <ou...@gmail.com>
> wrote:
>
> If we look at it with performance in mind,
> is it better to have 20 Nodes with 40 GB HDD
> or is it better to have 10 Nodes with 80 GB HDD?
>
> they are connected on a gigabit LAN
>
> Thnx
>
>
> On Fri, Jan 18, 2013 at 2:26 PM, Jean-Marc Spaggiari <
> jean-marc@spaggiari.org> wrote:
>
>> 20 nodes with 40 GB will do the work.
>>
>> After that you will have to consider performances based on your access
>> pattern. But that's another story.
>>
>> JM
>>
>> 2013/1/18, Panshul Whisper <ou...@gmail.com>:
>> > Thank you for the replies,
>> >
>> > So I take it that I should have atleast 800 GB on total free space on
>> > HDFS.. (combined free space of all the nodes connected to the cluster).
>> So
>> > I can connect 20 nodes having 40 GB of hdd on each node to my cluster.
>> Will
>> > this be enough for the storage?
>> > Please confirm.
>> >
>> > Thanking You,
>> > Regards,
>> > Panshul.
>> >
>> >
>> > On Fri, Jan 18, 2013 at 1:36 PM, Jean-Marc Spaggiari <
>> > jean-marc@spaggiari.org> wrote:
>> >
>> >> Hi Panshul,
>> >>
>> >> If you have 20 GB with a replication factor set to 3, you have only
>> >> 6.6GB available, not 11GB. You need to divide the total space by the
>> >> replication factor.
>> >>
>> >> Also, if you store your JSon into HBase, you need to add the key size
>> >> to it. If you key is 4 bytes, or 1024 bytes, it makes a difference.
>> >>
>> >> So roughly, 24 000 000 * 5 * 1024 = 114GB. You don't have the space to
>> >> store it. Without including the key size. Even with a replication
>> >> factor set to 5 you don't have the space.
>> >>
>> >> Now, you can add some compression, but even with a lucky factor of 50%
>> >> you still don't have the space. You will need something like 90%
>> >> compression factor to be able to store this data in your cluster.
>> >>
>> >> A 1T drive is now less than $100... So you might think about replacing
>> >> you 20 GB drives by something bigger.
>> >> to reply to your last question, for your data here, you will need AT
>> >> LEAST 350GB overall storage. But that's a bare minimum. Don't go under
>> >> 500GB.
>> >>
>> >> IMHO
>> >>
>> >> JM
>> >>
>> >> 2013/1/18, Panshul Whisper <ou...@gmail.com>:
>> >> > Hello,
>> >> >
>> >> > I was estimating how much disk space do I need for my cluster.
>> >> >
>> >> > I have 24 million JSON documents approx. 5kb each
>> >> > the Json is to be stored into HBASE with some identifying data in
>> >> coloumns
>> >> > and I also want to store the Json for later retrieval based on the Id
>> >> data
>> >> > as keys in Hbase.
>> >> > I have my HDFS replication set to 3
>> >> > each node has Hadoop and hbase and Ubuntu installed on it.. so approx
>> >> > 11
>> >> GB
>> >> > is available for use on my 20 GB node.
>> >> >
>> >> > I have no idea, if I have not enabled Hbase replication, is the HDFS
>> >> > replication enough to keep the data safe and redundant.
>> >> > How much total disk space I will need for the storage of the data.
>> >> >
>> >> > Please help me estimate this.
>> >> >
>> >> > Thank you so much.
>> >> >
>> >> > --
>> >> > Regards,
>> >> > Ouch Whisper
>> >> > 010101010101
>> >> >
>> >>
>> >
>> >
>> >
>> > --
>> > Regards,
>> > Ouch Whisper
>> > 010101010101
>> >
>>
>
>
>
> --
> Regards,
> Ouch Whisper
> 010101010101
>
>

Re: Estimating disk space requirements

Posted by Panshul Whisper <ou...@gmail.com>.

Thnx for the reply Ted,

You can find 40 GB disks when u make virtual nodes on a cloud like
Rackspace ;-)

About the os partitions I did not exactly understand what you meant.
I have made a server on the cloud.. And I just installed and configured
hadoop and hbase in the /use/local folder.
And I am pretty sure it does not have a separate partition for root.

Please help me explain what u meant and what else precautions should I take.

Thanks,

Regards,
Ouch Whisper
01010101010
On Jan 18, 2013 11:11 PM, "Ted Dunning" <td...@maprtech.com> wrote:

> Where do you find 40gb disks now a days?
>
> Normally your performance is going to be better with more space but your
> network may be your limiting factor for some computations.  That could give
> you some paradoxical scaling.  Hbase will rarely show this behavior.
>
> Keep in mind you also want to allow for an os partition. Current standard
> practice is to reserve as much as 100 GB for that partition but in your
> case 10gb better:-)
>
> Note that if you account for this, the node counts don't scale as simply.
>  The overhead of these os partitions goes up with number of nodes.
>
> On Jan 18, 2013, at 8:55 AM, Panshul Whisper <ou...@gmail.com>
> wrote:
>
> If we look at it with performance in mind,
> is it better to have 20 Nodes with 40 GB HDD
> or is it better to have 10 Nodes with 80 GB HDD?
>
> they are connected on a gigabit LAN
>
> Thnx
>
>
> On Fri, Jan 18, 2013 at 2:26 PM, Jean-Marc Spaggiari <
> jean-marc@spaggiari.org> wrote:
>
>> 20 nodes with 40 GB will do the work.
>>
>> After that you will have to consider performances based on your access
>> pattern. But that's another story.
>>
>> JM
>>
>> 2013/1/18, Panshul Whisper <ou...@gmail.com>:
>> > Thank you for the replies,
>> >
>> > So I take it that I should have atleast 800 GB on total free space on
>> > HDFS.. (combined free space of all the nodes connected to the cluster).
>> So
>> > I can connect 20 nodes having 40 GB of hdd on each node to my cluster.
>> Will
>> > this be enough for the storage?
>> > Please confirm.
>> >
>> > Thanking You,
>> > Regards,
>> > Panshul.
>> >
>> >
>> > On Fri, Jan 18, 2013 at 1:36 PM, Jean-Marc Spaggiari <
>> > jean-marc@spaggiari.org> wrote:
>> >
>> >> Hi Panshul,
>> >>
>> >> If you have 20 GB with a replication factor set to 3, you have only
>> >> 6.6GB available, not 11GB. You need to divide the total space by the
>> >> replication factor.
>> >>
>> >> Also, if you store your JSon into HBase, you need to add the key size
>> >> to it. If you key is 4 bytes, or 1024 bytes, it makes a difference.
>> >>
>> >> So roughly, 24 000 000 * 5 * 1024 = 114GB. You don't have the space to
>> >> store it. Without including the key size. Even with a replication
>> >> factor set to 5 you don't have the space.
>> >>
>> >> Now, you can add some compression, but even with a lucky factor of 50%
>> >> you still don't have the space. You will need something like 90%
>> >> compression factor to be able to store this data in your cluster.
>> >>
>> >> A 1T drive is now less than $100... So you might think about replacing
>> >> you 20 GB drives by something bigger.
>> >> to reply to your last question, for your data here, you will need AT
>> >> LEAST 350GB overall storage. But that's a bare minimum. Don't go under
>> >> 500GB.
>> >>
>> >> IMHO
>> >>
>> >> JM
>> >>
>> >> 2013/1/18, Panshul Whisper <ou...@gmail.com>:
>> >> > Hello,
>> >> >
>> >> > I was estimating how much disk space do I need for my cluster.
>> >> >
>> >> > I have 24 million JSON documents approx. 5kb each
>> >> > the Json is to be stored into HBASE with some identifying data in
>> >> coloumns
>> >> > and I also want to store the Json for later retrieval based on the Id
>> >> data
>> >> > as keys in Hbase.
>> >> > I have my HDFS replication set to 3
>> >> > each node has Hadoop and hbase and Ubuntu installed on it.. so approx
>> >> > 11
>> >> GB
>> >> > is available for use on my 20 GB node.
>> >> >
>> >> > I have no idea, if I have not enabled Hbase replication, is the HDFS
>> >> > replication enough to keep the data safe and redundant.
>> >> > How much total disk space I will need for the storage of the data.
>> >> >
>> >> > Please help me estimate this.
>> >> >
>> >> > Thank you so much.
>> >> >
>> >> > --
>> >> > Regards,
>> >> > Ouch Whisper
>> >> > 010101010101
>> >> >
>> >>
>> >
>> >
>> >
>> > --
>> > Regards,
>> > Ouch Whisper
>> > 010101010101
>> >
>>
>
>
>
> --
> Regards,
> Ouch Whisper
> 010101010101
>
>

Re: Estimating disk space requirements

Posted by Ted Dunning <td...@maprtech.com>.

Where do you find 40gb disks now a days?

Normally your performance is going to be better with more space but your network may be your limiting factor for some computations.  That could give you some paradoxical scaling.  Hbase will rarely show this behavior. 

Keep in mind you also want to allow for an os partition. Current standard practice is to reserve as much as 100 GB for that partition but in your case 10gb better:-)

Note that if you account for this, the node counts don't scale as simply.  The overhead of these os partitions goes up with number of nodes.  

On Jan 18, 2013, at 8:55 AM, Panshul Whisper <ou...@gmail.com> wrote:

> If we look at it with performance in mind, 
> is it better to have 20 Nodes with 40 GB HDD
> or is it better to have 10 Nodes with 80 GB HDD?
> 
> they are connected on a gigabit LAN
> 
> Thnx
> 
> 
> On Fri, Jan 18, 2013 at 2:26 PM, Jean-Marc Spaggiari <je...@spaggiari.org> wrote:
> 20 nodes with 40 GB will do the work.
> 
> After that you will have to consider performances based on your access
> pattern. But that's another story.
> 
> JM
> 
> 2013/1/18, Panshul Whisper <ou...@gmail.com>:
> > Thank you for the replies,
> >
> > So I take it that I should have atleast 800 GB on total free space on
> > HDFS.. (combined free space of all the nodes connected to the cluster). So
> > I can connect 20 nodes having 40 GB of hdd on each node to my cluster. Will
> > this be enough for the storage?
> > Please confirm.
> >
> > Thanking You,
> > Regards,
> > Panshul.
> >
> >
> > On Fri, Jan 18, 2013 at 1:36 PM, Jean-Marc Spaggiari <
> > jean-marc@spaggiari.org> wrote:
> >
> >> Hi Panshul,
> >>
> >> If you have 20 GB with a replication factor set to 3, you have only
> >> 6.6GB available, not 11GB. You need to divide the total space by the
> >> replication factor.
> >>
> >> Also, if you store your JSon into HBase, you need to add the key size
> >> to it. If you key is 4 bytes, or 1024 bytes, it makes a difference.
> >>
> >> So roughly, 24 000 000 * 5 * 1024 = 114GB. You don't have the space to
> >> store it. Without including the key size. Even with a replication
> >> factor set to 5 you don't have the space.
> >>
> >> Now, you can add some compression, but even with a lucky factor of 50%
> >> you still don't have the space. You will need something like 90%
> >> compression factor to be able to store this data in your cluster.
> >>
> >> A 1T drive is now less than $100... So you might think about replacing
> >> you 20 GB drives by something bigger.
> >> to reply to your last question, for your data here, you will need AT
> >> LEAST 350GB overall storage. But that's a bare minimum. Don't go under
> >> 500GB.
> >>
> >> IMHO
> >>
> >> JM
> >>
> >> 2013/1/18, Panshul Whisper <ou...@gmail.com>:
> >> > Hello,
> >> >
> >> > I was estimating how much disk space do I need for my cluster.
> >> >
> >> > I have 24 million JSON documents approx. 5kb each
> >> > the Json is to be stored into HBASE with some identifying data in
> >> coloumns
> >> > and I also want to store the Json for later retrieval based on the Id
> >> data
> >> > as keys in Hbase.
> >> > I have my HDFS replication set to 3
> >> > each node has Hadoop and hbase and Ubuntu installed on it.. so approx
> >> > 11
> >> GB
> >> > is available for use on my 20 GB node.
> >> >
> >> > I have no idea, if I have not enabled Hbase replication, is the HDFS
> >> > replication enough to keep the data safe and redundant.
> >> > How much total disk space I will need for the storage of the data.
> >> >
> >> > Please help me estimate this.
> >> >
> >> > Thank you so much.
> >> >
> >> > --
> >> > Regards,
> >> > Ouch Whisper
> >> > 010101010101
> >> >
> >>
> >
> >
> >
> > --
> > Regards,
> > Ouch Whisper
> > 010101010101
> >
> 
> 
> 
> -- 
> Regards,
> Ouch Whisper
> 010101010101

Re: Estimating disk space requirements

Posted by Ted Dunning <td...@maprtech.com>.

Where do you find 40gb disks now a days?

Normally your performance is going to be better with more space but your network may be your limiting factor for some computations.  That could give you some paradoxical scaling.  Hbase will rarely show this behavior. 

Keep in mind you also want to allow for an os partition. Current standard practice is to reserve as much as 100 GB for that partition but in your case 10gb better:-)

Note that if you account for this, the node counts don't scale as simply.  The overhead of these os partitions goes up with number of nodes.  

On Jan 18, 2013, at 8:55 AM, Panshul Whisper <ou...@gmail.com> wrote:

> If we look at it with performance in mind, 
> is it better to have 20 Nodes with 40 GB HDD
> or is it better to have 10 Nodes with 80 GB HDD?
> 
> they are connected on a gigabit LAN
> 
> Thnx
> 
> 
> On Fri, Jan 18, 2013 at 2:26 PM, Jean-Marc Spaggiari <je...@spaggiari.org> wrote:
> 20 nodes with 40 GB will do the work.
> 
> After that you will have to consider performances based on your access
> pattern. But that's another story.
> 
> JM
> 
> 2013/1/18, Panshul Whisper <ou...@gmail.com>:
> > Thank you for the replies,
> >
> > So I take it that I should have atleast 800 GB on total free space on
> > HDFS.. (combined free space of all the nodes connected to the cluster). So
> > I can connect 20 nodes having 40 GB of hdd on each node to my cluster. Will
> > this be enough for the storage?
> > Please confirm.
> >
> > Thanking You,
> > Regards,
> > Panshul.
> >
> >
> > On Fri, Jan 18, 2013 at 1:36 PM, Jean-Marc Spaggiari <
> > jean-marc@spaggiari.org> wrote:
> >
> >> Hi Panshul,
> >>
> >> If you have 20 GB with a replication factor set to 3, you have only
> >> 6.6GB available, not 11GB. You need to divide the total space by the
> >> replication factor.
> >>
> >> Also, if you store your JSon into HBase, you need to add the key size
> >> to it. If you key is 4 bytes, or 1024 bytes, it makes a difference.
> >>
> >> So roughly, 24 000 000 * 5 * 1024 = 114GB. You don't have the space to
> >> store it. Without including the key size. Even with a replication
> >> factor set to 5 you don't have the space.
> >>
> >> Now, you can add some compression, but even with a lucky factor of 50%
> >> you still don't have the space. You will need something like 90%
> >> compression factor to be able to store this data in your cluster.
> >>
> >> A 1T drive is now less than $100... So you might think about replacing
> >> you 20 GB drives by something bigger.
> >> to reply to your last question, for your data here, you will need AT
> >> LEAST 350GB overall storage. But that's a bare minimum. Don't go under
> >> 500GB.
> >>
> >> IMHO
> >>
> >> JM
> >>
> >> 2013/1/18, Panshul Whisper <ou...@gmail.com>:
> >> > Hello,
> >> >
> >> > I was estimating how much disk space do I need for my cluster.
> >> >
> >> > I have 24 million JSON documents approx. 5kb each
> >> > the Json is to be stored into HBASE with some identifying data in
> >> coloumns
> >> > and I also want to store the Json for later retrieval based on the Id
> >> data
> >> > as keys in Hbase.
> >> > I have my HDFS replication set to 3
> >> > each node has Hadoop and hbase and Ubuntu installed on it.. so approx
> >> > 11
> >> GB
> >> > is available for use on my 20 GB node.
> >> >
> >> > I have no idea, if I have not enabled Hbase replication, is the HDFS
> >> > replication enough to keep the data safe and redundant.
> >> > How much total disk space I will need for the storage of the data.
> >> >
> >> > Please help me estimate this.
> >> >
> >> > Thank you so much.
> >> >
> >> > --
> >> > Regards,
> >> > Ouch Whisper
> >> > 010101010101
> >> >
> >>
> >
> >
> >
> > --
> > Regards,
> > Ouch Whisper
> > 010101010101
> >
> 
> 
> 
> -- 
> Regards,
> Ouch Whisper
> 010101010101

Re: Estimating disk space requirements

Posted by Ted Dunning <td...@maprtech.com>.

Where do you find 40gb disks now a days?

Normally your performance is going to be better with more space but your network may be your limiting factor for some computations.  That could give you some paradoxical scaling.  Hbase will rarely show this behavior. 

Keep in mind you also want to allow for an os partition. Current standard practice is to reserve as much as 100 GB for that partition but in your case 10gb better:-)

Note that if you account for this, the node counts don't scale as simply.  The overhead of these os partitions goes up with number of nodes.  

On Jan 18, 2013, at 8:55 AM, Panshul Whisper <ou...@gmail.com> wrote:

> If we look at it with performance in mind, 
> is it better to have 20 Nodes with 40 GB HDD
> or is it better to have 10 Nodes with 80 GB HDD?
> 
> they are connected on a gigabit LAN
> 
> Thnx
> 
> 
> On Fri, Jan 18, 2013 at 2:26 PM, Jean-Marc Spaggiari <je...@spaggiari.org> wrote:
> 20 nodes with 40 GB will do the work.
> 
> After that you will have to consider performances based on your access
> pattern. But that's another story.
> 
> JM
> 
> 2013/1/18, Panshul Whisper <ou...@gmail.com>:
> > Thank you for the replies,
> >
> > So I take it that I should have atleast 800 GB on total free space on
> > HDFS.. (combined free space of all the nodes connected to the cluster). So
> > I can connect 20 nodes having 40 GB of hdd on each node to my cluster. Will
> > this be enough for the storage?
> > Please confirm.
> >
> > Thanking You,
> > Regards,
> > Panshul.
> >
> >
> > On Fri, Jan 18, 2013 at 1:36 PM, Jean-Marc Spaggiari <
> > jean-marc@spaggiari.org> wrote:
> >
> >> Hi Panshul,
> >>
> >> If you have 20 GB with a replication factor set to 3, you have only
> >> 6.6GB available, not 11GB. You need to divide the total space by the
> >> replication factor.
> >>
> >> Also, if you store your JSon into HBase, you need to add the key size
> >> to it. If you key is 4 bytes, or 1024 bytes, it makes a difference.
> >>
> >> So roughly, 24 000 000 * 5 * 1024 = 114GB. You don't have the space to
> >> store it. Without including the key size. Even with a replication
> >> factor set to 5 you don't have the space.
> >>
> >> Now, you can add some compression, but even with a lucky factor of 50%
> >> you still don't have the space. You will need something like 90%
> >> compression factor to be able to store this data in your cluster.
> >>
> >> A 1T drive is now less than $100... So you might think about replacing
> >> you 20 GB drives by something bigger.
> >> to reply to your last question, for your data here, you will need AT
> >> LEAST 350GB overall storage. But that's a bare minimum. Don't go under
> >> 500GB.
> >>
> >> IMHO
> >>
> >> JM
> >>
> >> 2013/1/18, Panshul Whisper <ou...@gmail.com>:
> >> > Hello,
> >> >
> >> > I was estimating how much disk space do I need for my cluster.
> >> >
> >> > I have 24 million JSON documents approx. 5kb each
> >> > the Json is to be stored into HBASE with some identifying data in
> >> coloumns
> >> > and I also want to store the Json for later retrieval based on the Id
> >> data
> >> > as keys in Hbase.
> >> > I have my HDFS replication set to 3
> >> > each node has Hadoop and hbase and Ubuntu installed on it.. so approx
> >> > 11
> >> GB
> >> > is available for use on my 20 GB node.
> >> >
> >> > I have no idea, if I have not enabled Hbase replication, is the HDFS
> >> > replication enough to keep the data safe and redundant.
> >> > How much total disk space I will need for the storage of the data.
> >> >
> >> > Please help me estimate this.
> >> >
> >> > Thank you so much.
> >> >
> >> > --
> >> > Regards,
> >> > Ouch Whisper
> >> > 010101010101
> >> >
> >>
> >
> >
> >
> > --
> > Regards,
> > Ouch Whisper
> > 010101010101
> >
> 
> 
> 
> -- 
> Regards,
> Ouch Whisper
> 010101010101

Re: Estimating disk space requirements

Posted by Jean-Marc Spaggiari <je...@spaggiari.org>.

It all depend what you want to do with this data and the power of each
single node. There is no one size fit all rule.

The more nodes you have, the more CPU power you will have to process
the data... But if you 80GB boxes CPUs are faster than your 40GB boxes
CPU ,maybe you should take the 80GB then.

If you want to get better advices from the list, you will need to
beter define you needs and the nodes you can have.

JM

2013/1/18, Panshul Whisper <ou...@gmail.com>:
> If we look at it with performance in mind,
> is it better to have 20 Nodes with 40 GB HDD
> or is it better to have 10 Nodes with 80 GB HDD?
>
> they are connected on a gigabit LAN
>
> Thnx
>
>
> On Fri, Jan 18, 2013 at 2:26 PM, Jean-Marc Spaggiari <
> jean-marc@spaggiari.org> wrote:
>
>> 20 nodes with 40 GB will do the work.
>>
>> After that you will have to consider performances based on your access
>> pattern. But that's another story.
>>
>> JM
>>
>> 2013/1/18, Panshul Whisper <ou...@gmail.com>:
>> > Thank you for the replies,
>> >
>> > So I take it that I should have atleast 800 GB on total free space on
>> > HDFS.. (combined free space of all the nodes connected to the cluster).
>> So
>> > I can connect 20 nodes having 40 GB of hdd on each node to my cluster.
>> Will
>> > this be enough for the storage?
>> > Please confirm.
>> >
>> > Thanking You,
>> > Regards,
>> > Panshul.
>> >
>> >
>> > On Fri, Jan 18, 2013 at 1:36 PM, Jean-Marc Spaggiari <
>> > jean-marc@spaggiari.org> wrote:
>> >
>> >> Hi Panshul,
>> >>
>> >> If you have 20 GB with a replication factor set to 3, you have only
>> >> 6.6GB available, not 11GB. You need to divide the total space by the
>> >> replication factor.
>> >>
>> >> Also, if you store your JSon into HBase, you need to add the key size
>> >> to it. If you key is 4 bytes, or 1024 bytes, it makes a difference.
>> >>
>> >> So roughly, 24 000 000 * 5 * 1024 = 114GB. You don't have the space to
>> >> store it. Without including the key size. Even with a replication
>> >> factor set to 5 you don't have the space.
>> >>
>> >> Now, you can add some compression, but even with a lucky factor of 50%
>> >> you still don't have the space. You will need something like 90%
>> >> compression factor to be able to store this data in your cluster.
>> >>
>> >> A 1T drive is now less than $100... So you might think about replacing
>> >> you 20 GB drives by something bigger.
>> >> to reply to your last question, for your data here, you will need AT
>> >> LEAST 350GB overall storage. But that's a bare minimum. Don't go under
>> >> 500GB.
>> >>
>> >> IMHO
>> >>
>> >> JM
>> >>
>> >> 2013/1/18, Panshul Whisper <ou...@gmail.com>:
>> >> > Hello,
>> >> >
>> >> > I was estimating how much disk space do I need for my cluster.
>> >> >
>> >> > I have 24 million JSON documents approx. 5kb each
>> >> > the Json is to be stored into HBASE with some identifying data in
>> >> coloumns
>> >> > and I also want to store the Json for later retrieval based on the
>> >> > Id
>> >> data
>> >> > as keys in Hbase.
>> >> > I have my HDFS replication set to 3
>> >> > each node has Hadoop and hbase and Ubuntu installed on it.. so
>> >> > approx
>> >> > 11
>> >> GB
>> >> > is available for use on my 20 GB node.
>> >> >
>> >> > I have no idea, if I have not enabled Hbase replication, is the HDFS
>> >> > replication enough to keep the data safe and redundant.
>> >> > How much total disk space I will need for the storage of the data.
>> >> >
>> >> > Please help me estimate this.
>> >> >
>> >> > Thank you so much.
>> >> >
>> >> > --
>> >> > Regards,
>> >> > Ouch Whisper
>> >> > 010101010101
>> >> >
>> >>
>> >
>> >
>> >
>> > --
>> > Regards,
>> > Ouch Whisper
>> > 010101010101
>> >
>>
>
>
>
> --
> Regards,
> Ouch Whisper
> 010101010101
>

Re: Estimating disk space requirements

Posted by Jean-Marc Spaggiari <je...@spaggiari.org>.

It all depend what you want to do with this data and the power of each
single node. There is no one size fit all rule.

The more nodes you have, the more CPU power you will have to process
the data... But if you 80GB boxes CPUs are faster than your 40GB boxes
CPU ,maybe you should take the 80GB then.

If you want to get better advices from the list, you will need to
beter define you needs and the nodes you can have.

JM

2013/1/18, Panshul Whisper <ou...@gmail.com>:
> If we look at it with performance in mind,
> is it better to have 20 Nodes with 40 GB HDD
> or is it better to have 10 Nodes with 80 GB HDD?
>
> they are connected on a gigabit LAN
>
> Thnx
>
>
> On Fri, Jan 18, 2013 at 2:26 PM, Jean-Marc Spaggiari <
> jean-marc@spaggiari.org> wrote:
>
>> 20 nodes with 40 GB will do the work.
>>
>> After that you will have to consider performances based on your access
>> pattern. But that's another story.
>>
>> JM
>>
>> 2013/1/18, Panshul Whisper <ou...@gmail.com>:
>> > Thank you for the replies,
>> >
>> > So I take it that I should have atleast 800 GB on total free space on
>> > HDFS.. (combined free space of all the nodes connected to the cluster).
>> So
>> > I can connect 20 nodes having 40 GB of hdd on each node to my cluster.
>> Will
>> > this be enough for the storage?
>> > Please confirm.
>> >
>> > Thanking You,
>> > Regards,
>> > Panshul.
>> >
>> >
>> > On Fri, Jan 18, 2013 at 1:36 PM, Jean-Marc Spaggiari <
>> > jean-marc@spaggiari.org> wrote:
>> >
>> >> Hi Panshul,
>> >>
>> >> If you have 20 GB with a replication factor set to 3, you have only
>> >> 6.6GB available, not 11GB. You need to divide the total space by the
>> >> replication factor.
>> >>
>> >> Also, if you store your JSon into HBase, you need to add the key size
>> >> to it. If you key is 4 bytes, or 1024 bytes, it makes a difference.
>> >>
>> >> So roughly, 24 000 000 * 5 * 1024 = 114GB. You don't have the space to
>> >> store it. Without including the key size. Even with a replication
>> >> factor set to 5 you don't have the space.
>> >>
>> >> Now, you can add some compression, but even with a lucky factor of 50%
>> >> you still don't have the space. You will need something like 90%
>> >> compression factor to be able to store this data in your cluster.
>> >>
>> >> A 1T drive is now less than $100... So you might think about replacing
>> >> you 20 GB drives by something bigger.
>> >> to reply to your last question, for your data here, you will need AT
>> >> LEAST 350GB overall storage. But that's a bare minimum. Don't go under
>> >> 500GB.
>> >>
>> >> IMHO
>> >>
>> >> JM
>> >>
>> >> 2013/1/18, Panshul Whisper <ou...@gmail.com>:
>> >> > Hello,
>> >> >
>> >> > I was estimating how much disk space do I need for my cluster.
>> >> >
>> >> > I have 24 million JSON documents approx. 5kb each
>> >> > the Json is to be stored into HBASE with some identifying data in
>> >> coloumns
>> >> > and I also want to store the Json for later retrieval based on the
>> >> > Id
>> >> data
>> >> > as keys in Hbase.
>> >> > I have my HDFS replication set to 3
>> >> > each node has Hadoop and hbase and Ubuntu installed on it.. so
>> >> > approx
>> >> > 11
>> >> GB
>> >> > is available for use on my 20 GB node.
>> >> >
>> >> > I have no idea, if I have not enabled Hbase replication, is the HDFS
>> >> > replication enough to keep the data safe and redundant.
>> >> > How much total disk space I will need for the storage of the data.
>> >> >
>> >> > Please help me estimate this.
>> >> >
>> >> > Thank you so much.
>> >> >
>> >> > --
>> >> > Regards,
>> >> > Ouch Whisper
>> >> > 010101010101
>> >> >
>> >>
>> >
>> >
>> >
>> > --
>> > Regards,
>> > Ouch Whisper
>> > 010101010101
>> >
>>
>
>
>
> --
> Regards,
> Ouch Whisper
> 010101010101
>

Re: Estimating disk space requirements

Posted by Ted Dunning <td...@maprtech.com>.

Where do you find 40gb disks now a days?

Normally your performance is going to be better with more space but your network may be your limiting factor for some computations.  That could give you some paradoxical scaling.  Hbase will rarely show this behavior. 

Keep in mind you also want to allow for an os partition. Current standard practice is to reserve as much as 100 GB for that partition but in your case 10gb better:-)

Note that if you account for this, the node counts don't scale as simply.  The overhead of these os partitions goes up with number of nodes.  

On Jan 18, 2013, at 8:55 AM, Panshul Whisper <ou...@gmail.com> wrote:

> If we look at it with performance in mind, 
> is it better to have 20 Nodes with 40 GB HDD
> or is it better to have 10 Nodes with 80 GB HDD?
> 
> they are connected on a gigabit LAN
> 
> Thnx
> 
> 
> On Fri, Jan 18, 2013 at 2:26 PM, Jean-Marc Spaggiari <je...@spaggiari.org> wrote:
> 20 nodes with 40 GB will do the work.
> 
> After that you will have to consider performances based on your access
> pattern. But that's another story.
> 
> JM
> 
> 2013/1/18, Panshul Whisper <ou...@gmail.com>:
> > Thank you for the replies,
> >
> > So I take it that I should have atleast 800 GB on total free space on
> > HDFS.. (combined free space of all the nodes connected to the cluster). So
> > I can connect 20 nodes having 40 GB of hdd on each node to my cluster. Will
> > this be enough for the storage?
> > Please confirm.
> >
> > Thanking You,
> > Regards,
> > Panshul.
> >
> >
> > On Fri, Jan 18, 2013 at 1:36 PM, Jean-Marc Spaggiari <
> > jean-marc@spaggiari.org> wrote:
> >
> >> Hi Panshul,
> >>
> >> If you have 20 GB with a replication factor set to 3, you have only
> >> 6.6GB available, not 11GB. You need to divide the total space by the
> >> replication factor.
> >>
> >> Also, if you store your JSon into HBase, you need to add the key size
> >> to it. If you key is 4 bytes, or 1024 bytes, it makes a difference.
> >>
> >> So roughly, 24 000 000 * 5 * 1024 = 114GB. You don't have the space to
> >> store it. Without including the key size. Even with a replication
> >> factor set to 5 you don't have the space.
> >>
> >> Now, you can add some compression, but even with a lucky factor of 50%
> >> you still don't have the space. You will need something like 90%
> >> compression factor to be able to store this data in your cluster.
> >>
> >> A 1T drive is now less than $100... So you might think about replacing
> >> you 20 GB drives by something bigger.
> >> to reply to your last question, for your data here, you will need AT
> >> LEAST 350GB overall storage. But that's a bare minimum. Don't go under
> >> 500GB.
> >>
> >> IMHO
> >>
> >> JM
> >>
> >> 2013/1/18, Panshul Whisper <ou...@gmail.com>:
> >> > Hello,
> >> >
> >> > I was estimating how much disk space do I need for my cluster.
> >> >
> >> > I have 24 million JSON documents approx. 5kb each
> >> > the Json is to be stored into HBASE with some identifying data in
> >> coloumns
> >> > and I also want to store the Json for later retrieval based on the Id
> >> data
> >> > as keys in Hbase.
> >> > I have my HDFS replication set to 3
> >> > each node has Hadoop and hbase and Ubuntu installed on it.. so approx
> >> > 11
> >> GB
> >> > is available for use on my 20 GB node.
> >> >
> >> > I have no idea, if I have not enabled Hbase replication, is the HDFS
> >> > replication enough to keep the data safe and redundant.
> >> > How much total disk space I will need for the storage of the data.
> >> >
> >> > Please help me estimate this.
> >> >
> >> > Thank you so much.
> >> >
> >> > --
> >> > Regards,
> >> > Ouch Whisper
> >> > 010101010101
> >> >
> >>
> >
> >
> >
> > --
> > Regards,
> > Ouch Whisper
> > 010101010101
> >
> 
> 
> 
> -- 
> Regards,
> Ouch Whisper
> 010101010101

Re: Estimating disk space requirements

Posted by Panshul Whisper <ou...@gmail.com>.

If we look at it with performance in mind,
is it better to have 20 Nodes with 40 GB HDD
or is it better to have 10 Nodes with 80 GB HDD?

they are connected on a gigabit LAN

Thnx


On Fri, Jan 18, 2013 at 2:26 PM, Jean-Marc Spaggiari <
jean-marc@spaggiari.org> wrote:

> 20 nodes with 40 GB will do the work.
>
> After that you will have to consider performances based on your access
> pattern. But that's another story.
>
> JM
>
> 2013/1/18, Panshul Whisper <ou...@gmail.com>:
> > Thank you for the replies,
> >
> > So I take it that I should have atleast 800 GB on total free space on
> > HDFS.. (combined free space of all the nodes connected to the cluster).
> So
> > I can connect 20 nodes having 40 GB of hdd on each node to my cluster.
> Will
> > this be enough for the storage?
> > Please confirm.
> >
> > Thanking You,
> > Regards,
> > Panshul.
> >
> >
> > On Fri, Jan 18, 2013 at 1:36 PM, Jean-Marc Spaggiari <
> > jean-marc@spaggiari.org> wrote:
> >
> >> Hi Panshul,
> >>
> >> If you have 20 GB with a replication factor set to 3, you have only
> >> 6.6GB available, not 11GB. You need to divide the total space by the
> >> replication factor.
> >>
> >> Also, if you store your JSon into HBase, you need to add the key size
> >> to it. If you key is 4 bytes, or 1024 bytes, it makes a difference.
> >>
> >> So roughly, 24 000 000 * 5 * 1024 = 114GB. You don't have the space to
> >> store it. Without including the key size. Even with a replication
> >> factor set to 5 you don't have the space.
> >>
> >> Now, you can add some compression, but even with a lucky factor of 50%
> >> you still don't have the space. You will need something like 90%
> >> compression factor to be able to store this data in your cluster.
> >>
> >> A 1T drive is now less than $100... So you might think about replacing
> >> you 20 GB drives by something bigger.
> >> to reply to your last question, for your data here, you will need AT
> >> LEAST 350GB overall storage. But that's a bare minimum. Don't go under
> >> 500GB.
> >>
> >> IMHO
> >>
> >> JM
> >>
> >> 2013/1/18, Panshul Whisper <ou...@gmail.com>:
> >> > Hello,
> >> >
> >> > I was estimating how much disk space do I need for my cluster.
> >> >
> >> > I have 24 million JSON documents approx. 5kb each
> >> > the Json is to be stored into HBASE with some identifying data in
> >> coloumns
> >> > and I also want to store the Json for later retrieval based on the Id
> >> data
> >> > as keys in Hbase.
> >> > I have my HDFS replication set to 3
> >> > each node has Hadoop and hbase and Ubuntu installed on it.. so approx
> >> > 11
> >> GB
> >> > is available for use on my 20 GB node.
> >> >
> >> > I have no idea, if I have not enabled Hbase replication, is the HDFS
> >> > replication enough to keep the data safe and redundant.
> >> > How much total disk space I will need for the storage of the data.
> >> >
> >> > Please help me estimate this.
> >> >
> >> > Thank you so much.
> >> >
> >> > --
> >> > Regards,
> >> > Ouch Whisper
> >> > 010101010101
> >> >
> >>
> >
> >
> >
> > --
> > Regards,
> > Ouch Whisper
> > 010101010101
> >
>



-- 
Regards,
Ouch Whisper
010101010101

Re: Estimating disk space requirements

Posted by Panshul Whisper <ou...@gmail.com>.

If we look at it with performance in mind,
is it better to have 20 Nodes with 40 GB HDD
or is it better to have 10 Nodes with 80 GB HDD?

they are connected on a gigabit LAN

Thnx


On Fri, Jan 18, 2013 at 2:26 PM, Jean-Marc Spaggiari <
jean-marc@spaggiari.org> wrote:

> 20 nodes with 40 GB will do the work.
>
> After that you will have to consider performances based on your access
> pattern. But that's another story.
>
> JM
>
> 2013/1/18, Panshul Whisper <ou...@gmail.com>:
> > Thank you for the replies,
> >
> > So I take it that I should have atleast 800 GB on total free space on
> > HDFS.. (combined free space of all the nodes connected to the cluster).
> So
> > I can connect 20 nodes having 40 GB of hdd on each node to my cluster.
> Will
> > this be enough for the storage?
> > Please confirm.
> >
> > Thanking You,
> > Regards,
> > Panshul.
> >
> >
> > On Fri, Jan 18, 2013 at 1:36 PM, Jean-Marc Spaggiari <
> > jean-marc@spaggiari.org> wrote:
> >
> >> Hi Panshul,
> >>
> >> If you have 20 GB with a replication factor set to 3, you have only
> >> 6.6GB available, not 11GB. You need to divide the total space by the
> >> replication factor.
> >>
> >> Also, if you store your JSon into HBase, you need to add the key size
> >> to it. If you key is 4 bytes, or 1024 bytes, it makes a difference.
> >>
> >> So roughly, 24 000 000 * 5 * 1024 = 114GB. You don't have the space to
> >> store it. Without including the key size. Even with a replication
> >> factor set to 5 you don't have the space.
> >>
> >> Now, you can add some compression, but even with a lucky factor of 50%
> >> you still don't have the space. You will need something like 90%
> >> compression factor to be able to store this data in your cluster.
> >>
> >> A 1T drive is now less than $100... So you might think about replacing
> >> you 20 GB drives by something bigger.
> >> to reply to your last question, for your data here, you will need AT
> >> LEAST 350GB overall storage. But that's a bare minimum. Don't go under
> >> 500GB.
> >>
> >> IMHO
> >>
> >> JM
> >>
> >> 2013/1/18, Panshul Whisper <ou...@gmail.com>:
> >> > Hello,
> >> >
> >> > I was estimating how much disk space do I need for my cluster.
> >> >
> >> > I have 24 million JSON documents approx. 5kb each
> >> > the Json is to be stored into HBASE with some identifying data in
> >> coloumns
> >> > and I also want to store the Json for later retrieval based on the Id
> >> data
> >> > as keys in Hbase.
> >> > I have my HDFS replication set to 3
> >> > each node has Hadoop and hbase and Ubuntu installed on it.. so approx
> >> > 11
> >> GB
> >> > is available for use on my 20 GB node.
> >> >
> >> > I have no idea, if I have not enabled Hbase replication, is the HDFS
> >> > replication enough to keep the data safe and redundant.
> >> > How much total disk space I will need for the storage of the data.
> >> >
> >> > Please help me estimate this.
> >> >
> >> > Thank you so much.
> >> >
> >> > --
> >> > Regards,
> >> > Ouch Whisper
> >> > 010101010101
> >> >
> >>
> >
> >
> >
> > --
> > Regards,
> > Ouch Whisper
> > 010101010101
> >
>



-- 
Regards,
Ouch Whisper
010101010101

Re: Estimating disk space requirements

Posted by Panshul Whisper <ou...@gmail.com>.

If we look at it with performance in mind,
is it better to have 20 Nodes with 40 GB HDD
or is it better to have 10 Nodes with 80 GB HDD?

they are connected on a gigabit LAN

Thnx


On Fri, Jan 18, 2013 at 2:26 PM, Jean-Marc Spaggiari <
jean-marc@spaggiari.org> wrote:

> 20 nodes with 40 GB will do the work.
>
> After that you will have to consider performances based on your access
> pattern. But that's another story.
>
> JM
>
> 2013/1/18, Panshul Whisper <ou...@gmail.com>:
> > Thank you for the replies,
> >
> > So I take it that I should have atleast 800 GB on total free space on
> > HDFS.. (combined free space of all the nodes connected to the cluster).
> So
> > I can connect 20 nodes having 40 GB of hdd on each node to my cluster.
> Will
> > this be enough for the storage?
> > Please confirm.
> >
> > Thanking You,
> > Regards,
> > Panshul.
> >
> >
> > On Fri, Jan 18, 2013 at 1:36 PM, Jean-Marc Spaggiari <
> > jean-marc@spaggiari.org> wrote:
> >
> >> Hi Panshul,
> >>
> >> If you have 20 GB with a replication factor set to 3, you have only
> >> 6.6GB available, not 11GB. You need to divide the total space by the
> >> replication factor.
> >>
> >> Also, if you store your JSon into HBase, you need to add the key size
> >> to it. If you key is 4 bytes, or 1024 bytes, it makes a difference.
> >>
> >> So roughly, 24 000 000 * 5 * 1024 = 114GB. You don't have the space to
> >> store it. Without including the key size. Even with a replication
> >> factor set to 5 you don't have the space.
> >>
> >> Now, you can add some compression, but even with a lucky factor of 50%
> >> you still don't have the space. You will need something like 90%
> >> compression factor to be able to store this data in your cluster.
> >>
> >> A 1T drive is now less than $100... So you might think about replacing
> >> you 20 GB drives by something bigger.
> >> to reply to your last question, for your data here, you will need AT
> >> LEAST 350GB overall storage. But that's a bare minimum. Don't go under
> >> 500GB.
> >>
> >> IMHO
> >>
> >> JM
> >>
> >> 2013/1/18, Panshul Whisper <ou...@gmail.com>:
> >> > Hello,
> >> >
> >> > I was estimating how much disk space do I need for my cluster.
> >> >
> >> > I have 24 million JSON documents approx. 5kb each
> >> > the Json is to be stored into HBASE with some identifying data in
> >> coloumns
> >> > and I also want to store the Json for later retrieval based on the Id
> >> data
> >> > as keys in Hbase.
> >> > I have my HDFS replication set to 3
> >> > each node has Hadoop and hbase and Ubuntu installed on it.. so approx
> >> > 11
> >> GB
> >> > is available for use on my 20 GB node.
> >> >
> >> > I have no idea, if I have not enabled Hbase replication, is the HDFS
> >> > replication enough to keep the data safe and redundant.
> >> > How much total disk space I will need for the storage of the data.
> >> >
> >> > Please help me estimate this.
> >> >
> >> > Thank you so much.
> >> >
> >> > --
> >> > Regards,
> >> > Ouch Whisper
> >> > 010101010101
> >> >
> >>
> >
> >
> >
> > --
> > Regards,
> > Ouch Whisper
> > 010101010101
> >
>



-- 
Regards,
Ouch Whisper
010101010101

Re: Estimating disk space requirements

Posted by Panshul Whisper <ou...@gmail.com>.

If we look at it with performance in mind,
is it better to have 20 Nodes with 40 GB HDD
or is it better to have 10 Nodes with 80 GB HDD?

they are connected on a gigabit LAN

Thnx


On Fri, Jan 18, 2013 at 2:26 PM, Jean-Marc Spaggiari <
jean-marc@spaggiari.org> wrote:

> 20 nodes with 40 GB will do the work.
>
> After that you will have to consider performances based on your access
> pattern. But that's another story.
>
> JM
>
> 2013/1/18, Panshul Whisper <ou...@gmail.com>:
> > Thank you for the replies,
> >
> > So I take it that I should have atleast 800 GB on total free space on
> > HDFS.. (combined free space of all the nodes connected to the cluster).
> So
> > I can connect 20 nodes having 40 GB of hdd on each node to my cluster.
> Will
> > this be enough for the storage?
> > Please confirm.
> >
> > Thanking You,
> > Regards,
> > Panshul.
> >
> >
> > On Fri, Jan 18, 2013 at 1:36 PM, Jean-Marc Spaggiari <
> > jean-marc@spaggiari.org> wrote:
> >
> >> Hi Panshul,
> >>
> >> If you have 20 GB with a replication factor set to 3, you have only
> >> 6.6GB available, not 11GB. You need to divide the total space by the
> >> replication factor.
> >>
> >> Also, if you store your JSon into HBase, you need to add the key size
> >> to it. If you key is 4 bytes, or 1024 bytes, it makes a difference.
> >>
> >> So roughly, 24 000 000 * 5 * 1024 = 114GB. You don't have the space to
> >> store it. Without including the key size. Even with a replication
> >> factor set to 5 you don't have the space.
> >>
> >> Now, you can add some compression, but even with a lucky factor of 50%
> >> you still don't have the space. You will need something like 90%
> >> compression factor to be able to store this data in your cluster.
> >>
> >> A 1T drive is now less than $100... So you might think about replacing
> >> you 20 GB drives by something bigger.
> >> to reply to your last question, for your data here, you will need AT
> >> LEAST 350GB overall storage. But that's a bare minimum. Don't go under
> >> 500GB.
> >>
> >> IMHO
> >>
> >> JM
> >>
> >> 2013/1/18, Panshul Whisper <ou...@gmail.com>:
> >> > Hello,
> >> >
> >> > I was estimating how much disk space do I need for my cluster.
> >> >
> >> > I have 24 million JSON documents approx. 5kb each
> >> > the Json is to be stored into HBASE with some identifying data in
> >> coloumns
> >> > and I also want to store the Json for later retrieval based on the Id
> >> data
> >> > as keys in Hbase.
> >> > I have my HDFS replication set to 3
> >> > each node has Hadoop and hbase and Ubuntu installed on it.. so approx
> >> > 11
> >> GB
> >> > is available for use on my 20 GB node.
> >> >
> >> > I have no idea, if I have not enabled Hbase replication, is the HDFS
> >> > replication enough to keep the data safe and redundant.
> >> > How much total disk space I will need for the storage of the data.
> >> >
> >> > Please help me estimate this.
> >> >
> >> > Thank you so much.
> >> >
> >> > --
> >> > Regards,
> >> > Ouch Whisper
> >> > 010101010101
> >> >
> >>
> >
> >
> >
> > --
> > Regards,
> > Ouch Whisper
> > 010101010101
> >
>



-- 
Regards,
Ouch Whisper
010101010101

Re: Estimating disk space requirements

Posted by Jean-Marc Spaggiari <je...@spaggiari.org>.

20 nodes with 40 GB will do the work.

After that you will have to consider performances based on your access
pattern. But that's another story.

JM

2013/1/18, Panshul Whisper <ou...@gmail.com>:
> Thank you for the replies,
>
> So I take it that I should have atleast 800 GB on total free space on
> HDFS.. (combined free space of all the nodes connected to the cluster). So
> I can connect 20 nodes having 40 GB of hdd on each node to my cluster. Will
> this be enough for the storage?
> Please confirm.
>
> Thanking You,
> Regards,
> Panshul.
>
>
> On Fri, Jan 18, 2013 at 1:36 PM, Jean-Marc Spaggiari <
> jean-marc@spaggiari.org> wrote:
>
>> Hi Panshul,
>>
>> If you have 20 GB with a replication factor set to 3, you have only
>> 6.6GB available, not 11GB. You need to divide the total space by the
>> replication factor.
>>
>> Also, if you store your JSon into HBase, you need to add the key size
>> to it. If you key is 4 bytes, or 1024 bytes, it makes a difference.
>>
>> So roughly, 24 000 000 * 5 * 1024 = 114GB. You don't have the space to
>> store it. Without including the key size. Even with a replication
>> factor set to 5 you don't have the space.
>>
>> Now, you can add some compression, but even with a lucky factor of 50%
>> you still don't have the space. You will need something like 90%
>> compression factor to be able to store this data in your cluster.
>>
>> A 1T drive is now less than $100... So you might think about replacing
>> you 20 GB drives by something bigger.
>> to reply to your last question, for your data here, you will need AT
>> LEAST 350GB overall storage. But that's a bare minimum. Don't go under
>> 500GB.
>>
>> IMHO
>>
>> JM
>>
>> 2013/1/18, Panshul Whisper <ou...@gmail.com>:
>> > Hello,
>> >
>> > I was estimating how much disk space do I need for my cluster.
>> >
>> > I have 24 million JSON documents approx. 5kb each
>> > the Json is to be stored into HBASE with some identifying data in
>> coloumns
>> > and I also want to store the Json for later retrieval based on the Id
>> data
>> > as keys in Hbase.
>> > I have my HDFS replication set to 3
>> > each node has Hadoop and hbase and Ubuntu installed on it.. so approx
>> > 11
>> GB
>> > is available for use on my 20 GB node.
>> >
>> > I have no idea, if I have not enabled Hbase replication, is the HDFS
>> > replication enough to keep the data safe and redundant.
>> > How much total disk space I will need for the storage of the data.
>> >
>> > Please help me estimate this.
>> >
>> > Thank you so much.
>> >
>> > --
>> > Regards,
>> > Ouch Whisper
>> > 010101010101
>> >
>>
>
>
>
> --
> Regards,
> Ouch Whisper
> 010101010101
>

Re: Estimating disk space requirements

Posted by Jean-Marc Spaggiari <je...@spaggiari.org>.

20 nodes with 40 GB will do the work.

After that you will have to consider performances based on your access
pattern. But that's another story.

JM

2013/1/18, Panshul Whisper <ou...@gmail.com>:
> Thank you for the replies,
>
> So I take it that I should have atleast 800 GB on total free space on
> HDFS.. (combined free space of all the nodes connected to the cluster). So
> I can connect 20 nodes having 40 GB of hdd on each node to my cluster. Will
> this be enough for the storage?
> Please confirm.
>
> Thanking You,
> Regards,
> Panshul.
>
>
> On Fri, Jan 18, 2013 at 1:36 PM, Jean-Marc Spaggiari <
> jean-marc@spaggiari.org> wrote:
>
>> Hi Panshul,
>>
>> If you have 20 GB with a replication factor set to 3, you have only
>> 6.6GB available, not 11GB. You need to divide the total space by the
>> replication factor.
>>
>> Also, if you store your JSon into HBase, you need to add the key size
>> to it. If you key is 4 bytes, or 1024 bytes, it makes a difference.
>>
>> So roughly, 24 000 000 * 5 * 1024 = 114GB. You don't have the space to
>> store it. Without including the key size. Even with a replication
>> factor set to 5 you don't have the space.
>>
>> Now, you can add some compression, but even with a lucky factor of 50%
>> you still don't have the space. You will need something like 90%
>> compression factor to be able to store this data in your cluster.
>>
>> A 1T drive is now less than $100... So you might think about replacing
>> you 20 GB drives by something bigger.
>> to reply to your last question, for your data here, you will need AT
>> LEAST 350GB overall storage. But that's a bare minimum. Don't go under
>> 500GB.
>>
>> IMHO
>>
>> JM
>>
>> 2013/1/18, Panshul Whisper <ou...@gmail.com>:
>> > Hello,
>> >
>> > I was estimating how much disk space do I need for my cluster.
>> >
>> > I have 24 million JSON documents approx. 5kb each
>> > the Json is to be stored into HBASE with some identifying data in
>> coloumns
>> > and I also want to store the Json for later retrieval based on the Id
>> data
>> > as keys in Hbase.
>> > I have my HDFS replication set to 3
>> > each node has Hadoop and hbase and Ubuntu installed on it.. so approx
>> > 11
>> GB
>> > is available for use on my 20 GB node.
>> >
>> > I have no idea, if I have not enabled Hbase replication, is the HDFS
>> > replication enough to keep the data safe and redundant.
>> > How much total disk space I will need for the storage of the data.
>> >
>> > Please help me estimate this.
>> >
>> > Thank you so much.
>> >
>> > --
>> > Regards,
>> > Ouch Whisper
>> > 010101010101
>> >
>>
>
>
>
> --
> Regards,
> Ouch Whisper
> 010101010101
>

Re: Estimating disk space requirements

Posted by Jean-Marc Spaggiari <je...@spaggiari.org>.

20 nodes with 40 GB will do the work.

After that you will have to consider performances based on your access
pattern. But that's another story.

JM

2013/1/18, Panshul Whisper <ou...@gmail.com>:
> Thank you for the replies,
>
> So I take it that I should have atleast 800 GB on total free space on
> HDFS.. (combined free space of all the nodes connected to the cluster). So
> I can connect 20 nodes having 40 GB of hdd on each node to my cluster. Will
> this be enough for the storage?
> Please confirm.
>
> Thanking You,
> Regards,
> Panshul.
>
>
> On Fri, Jan 18, 2013 at 1:36 PM, Jean-Marc Spaggiari <
> jean-marc@spaggiari.org> wrote:
>
>> Hi Panshul,
>>
>> If you have 20 GB with a replication factor set to 3, you have only
>> 6.6GB available, not 11GB. You need to divide the total space by the
>> replication factor.
>>
>> Also, if you store your JSon into HBase, you need to add the key size
>> to it. If you key is 4 bytes, or 1024 bytes, it makes a difference.
>>
>> So roughly, 24 000 000 * 5 * 1024 = 114GB. You don't have the space to
>> store it. Without including the key size. Even with a replication
>> factor set to 5 you don't have the space.
>>
>> Now, you can add some compression, but even with a lucky factor of 50%
>> you still don't have the space. You will need something like 90%
>> compression factor to be able to store this data in your cluster.
>>
>> A 1T drive is now less than $100... So you might think about replacing
>> you 20 GB drives by something bigger.
>> to reply to your last question, for your data here, you will need AT
>> LEAST 350GB overall storage. But that's a bare minimum. Don't go under
>> 500GB.
>>
>> IMHO
>>
>> JM
>>
>> 2013/1/18, Panshul Whisper <ou...@gmail.com>:
>> > Hello,
>> >
>> > I was estimating how much disk space do I need for my cluster.
>> >
>> > I have 24 million JSON documents approx. 5kb each
>> > the Json is to be stored into HBASE with some identifying data in
>> coloumns
>> > and I also want to store the Json for later retrieval based on the Id
>> data
>> > as keys in Hbase.
>> > I have my HDFS replication set to 3
>> > each node has Hadoop and hbase and Ubuntu installed on it.. so approx
>> > 11
>> GB
>> > is available for use on my 20 GB node.
>> >
>> > I have no idea, if I have not enabled Hbase replication, is the HDFS
>> > replication enough to keep the data safe and redundant.
>> > How much total disk space I will need for the storage of the data.
>> >
>> > Please help me estimate this.
>> >
>> > Thank you so much.
>> >
>> > --
>> > Regards,
>> > Ouch Whisper
>> > 010101010101
>> >
>>
>
>
>
> --
> Regards,
> Ouch Whisper
> 010101010101
>

Re: Estimating disk space requirements

Posted by Jean-Marc Spaggiari <je...@spaggiari.org>.

20 nodes with 40 GB will do the work.

After that you will have to consider performances based on your access
pattern. But that's another story.

JM

2013/1/18, Panshul Whisper <ou...@gmail.com>:
> Thank you for the replies,
>
> So I take it that I should have atleast 800 GB on total free space on
> HDFS.. (combined free space of all the nodes connected to the cluster). So
> I can connect 20 nodes having 40 GB of hdd on each node to my cluster. Will
> this be enough for the storage?
> Please confirm.
>
> Thanking You,
> Regards,
> Panshul.
>
>
> On Fri, Jan 18, 2013 at 1:36 PM, Jean-Marc Spaggiari <
> jean-marc@spaggiari.org> wrote:
>
>> Hi Panshul,
>>
>> If you have 20 GB with a replication factor set to 3, you have only
>> 6.6GB available, not 11GB. You need to divide the total space by the
>> replication factor.
>>
>> Also, if you store your JSon into HBase, you need to add the key size
>> to it. If you key is 4 bytes, or 1024 bytes, it makes a difference.
>>
>> So roughly, 24 000 000 * 5 * 1024 = 114GB. You don't have the space to
>> store it. Without including the key size. Even with a replication
>> factor set to 5 you don't have the space.
>>
>> Now, you can add some compression, but even with a lucky factor of 50%
>> you still don't have the space. You will need something like 90%
>> compression factor to be able to store this data in your cluster.
>>
>> A 1T drive is now less than $100... So you might think about replacing
>> you 20 GB drives by something bigger.
>> to reply to your last question, for your data here, you will need AT
>> LEAST 350GB overall storage. But that's a bare minimum. Don't go under
>> 500GB.
>>
>> IMHO
>>
>> JM
>>
>> 2013/1/18, Panshul Whisper <ou...@gmail.com>:
>> > Hello,
>> >
>> > I was estimating how much disk space do I need for my cluster.
>> >
>> > I have 24 million JSON documents approx. 5kb each
>> > the Json is to be stored into HBASE with some identifying data in
>> coloumns
>> > and I also want to store the Json for later retrieval based on the Id
>> data
>> > as keys in Hbase.
>> > I have my HDFS replication set to 3
>> > each node has Hadoop and hbase and Ubuntu installed on it.. so approx
>> > 11
>> GB
>> > is available for use on my 20 GB node.
>> >
>> > I have no idea, if I have not enabled Hbase replication, is the HDFS
>> > replication enough to keep the data safe and redundant.
>> > How much total disk space I will need for the storage of the data.
>> >
>> > Please help me estimate this.
>> >
>> > Thank you so much.
>> >
>> > --
>> > Regards,
>> > Ouch Whisper
>> > 010101010101
>> >
>>
>
>
>
> --
> Regards,
> Ouch Whisper
> 010101010101
>

Re: Estimating disk space requirements

Posted by Panshul Whisper <ou...@gmail.com>.

Thank you for the replies,

So I take it that I should have atleast 800 GB on total free space on
HDFS.. (combined free space of all the nodes connected to the cluster). So
I can connect 20 nodes having 40 GB of hdd on each node to my cluster. Will
this be enough for the storage?
Please confirm.

Thanking You,
Regards,
Panshul.


On Fri, Jan 18, 2013 at 1:36 PM, Jean-Marc Spaggiari <
jean-marc@spaggiari.org> wrote:

> Hi Panshul,
>
> If you have 20 GB with a replication factor set to 3, you have only
> 6.6GB available, not 11GB. You need to divide the total space by the
> replication factor.
>
> Also, if you store your JSon into HBase, you need to add the key size
> to it. If you key is 4 bytes, or 1024 bytes, it makes a difference.
>
> So roughly, 24 000 000 * 5 * 1024 = 114GB. You don't have the space to
> store it. Without including the key size. Even with a replication
> factor set to 5 you don't have the space.
>
> Now, you can add some compression, but even with a lucky factor of 50%
> you still don't have the space. You will need something like 90%
> compression factor to be able to store this data in your cluster.
>
> A 1T drive is now less than $100... So you might think about replacing
> you 20 GB drives by something bigger.
> to reply to your last question, for your data here, you will need AT
> LEAST 350GB overall storage. But that's a bare minimum. Don't go under
> 500GB.
>
> IMHO
>
> JM
>
> 2013/1/18, Panshul Whisper <ou...@gmail.com>:
> > Hello,
> >
> > I was estimating how much disk space do I need for my cluster.
> >
> > I have 24 million JSON documents approx. 5kb each
> > the Json is to be stored into HBASE with some identifying data in
> coloumns
> > and I also want to store the Json for later retrieval based on the Id
> data
> > as keys in Hbase.
> > I have my HDFS replication set to 3
> > each node has Hadoop and hbase and Ubuntu installed on it.. so approx 11
> GB
> > is available for use on my 20 GB node.
> >
> > I have no idea, if I have not enabled Hbase replication, is the HDFS
> > replication enough to keep the data safe and redundant.
> > How much total disk space I will need for the storage of the data.
> >
> > Please help me estimate this.
> >
> > Thank you so much.
> >
> > --
> > Regards,
> > Ouch Whisper
> > 010101010101
> >
>



-- 
Regards,
Ouch Whisper
010101010101

Re: Estimating disk space requirements

Posted by Panshul Whisper <ou...@gmail.com>.

Thank you for the replies,

So I take it that I should have atleast 800 GB on total free space on
HDFS.. (combined free space of all the nodes connected to the cluster). So
I can connect 20 nodes having 40 GB of hdd on each node to my cluster. Will
this be enough for the storage?
Please confirm.

Thanking You,
Regards,
Panshul.


On Fri, Jan 18, 2013 at 1:36 PM, Jean-Marc Spaggiari <
jean-marc@spaggiari.org> wrote:

> Hi Panshul,
>
> If you have 20 GB with a replication factor set to 3, you have only
> 6.6GB available, not 11GB. You need to divide the total space by the
> replication factor.
>
> Also, if you store your JSon into HBase, you need to add the key size
> to it. If you key is 4 bytes, or 1024 bytes, it makes a difference.
>
> So roughly, 24 000 000 * 5 * 1024 = 114GB. You don't have the space to
> store it. Without including the key size. Even with a replication
> factor set to 5 you don't have the space.
>
> Now, you can add some compression, but even with a lucky factor of 50%
> you still don't have the space. You will need something like 90%
> compression factor to be able to store this data in your cluster.
>
> A 1T drive is now less than $100... So you might think about replacing
> you 20 GB drives by something bigger.
> to reply to your last question, for your data here, you will need AT
> LEAST 350GB overall storage. But that's a bare minimum. Don't go under
> 500GB.
>
> IMHO
>
> JM
>
> 2013/1/18, Panshul Whisper <ou...@gmail.com>:
> > Hello,
> >
> > I was estimating how much disk space do I need for my cluster.
> >
> > I have 24 million JSON documents approx. 5kb each
> > the Json is to be stored into HBASE with some identifying data in
> coloumns
> > and I also want to store the Json for later retrieval based on the Id
> data
> > as keys in Hbase.
> > I have my HDFS replication set to 3
> > each node has Hadoop and hbase and Ubuntu installed on it.. so approx 11
> GB
> > is available for use on my 20 GB node.
> >
> > I have no idea, if I have not enabled Hbase replication, is the HDFS
> > replication enough to keep the data safe and redundant.
> > How much total disk space I will need for the storage of the data.
> >
> > Please help me estimate this.
> >
> > Thank you so much.
> >
> > --
> > Regards,
> > Ouch Whisper
> > 010101010101
> >
>



-- 
Regards,
Ouch Whisper
010101010101

Re: Estimating disk space requirements

Posted by Panshul Whisper <ou...@gmail.com>.

Thank you for the replies,

So I take it that I should have atleast 800 GB on total free space on
HDFS.. (combined free space of all the nodes connected to the cluster). So
I can connect 20 nodes having 40 GB of hdd on each node to my cluster. Will
this be enough for the storage?
Please confirm.

Thanking You,
Regards,
Panshul.


On Fri, Jan 18, 2013 at 1:36 PM, Jean-Marc Spaggiari <
jean-marc@spaggiari.org> wrote:

> Hi Panshul,
>
> If you have 20 GB with a replication factor set to 3, you have only
> 6.6GB available, not 11GB. You need to divide the total space by the
> replication factor.
>
> Also, if you store your JSon into HBase, you need to add the key size
> to it. If you key is 4 bytes, or 1024 bytes, it makes a difference.
>
> So roughly, 24 000 000 * 5 * 1024 = 114GB. You don't have the space to
> store it. Without including the key size. Even with a replication
> factor set to 5 you don't have the space.
>
> Now, you can add some compression, but even with a lucky factor of 50%
> you still don't have the space. You will need something like 90%
> compression factor to be able to store this data in your cluster.
>
> A 1T drive is now less than $100... So you might think about replacing
> you 20 GB drives by something bigger.
> to reply to your last question, for your data here, you will need AT
> LEAST 350GB overall storage. But that's a bare minimum. Don't go under
> 500GB.
>
> IMHO
>
> JM
>
> 2013/1/18, Panshul Whisper <ou...@gmail.com>:
> > Hello,
> >
> > I was estimating how much disk space do I need for my cluster.
> >
> > I have 24 million JSON documents approx. 5kb each
> > the Json is to be stored into HBASE with some identifying data in
> coloumns
> > and I also want to store the Json for later retrieval based on the Id
> data
> > as keys in Hbase.
> > I have my HDFS replication set to 3
> > each node has Hadoop and hbase and Ubuntu installed on it.. so approx 11
> GB
> > is available for use on my 20 GB node.
> >
> > I have no idea, if I have not enabled Hbase replication, is the HDFS
> > replication enough to keep the data safe and redundant.
> > How much total disk space I will need for the storage of the data.
> >
> > Please help me estimate this.
> >
> > Thank you so much.
> >
> > --
> > Regards,
> > Ouch Whisper
> > 010101010101
> >
>



-- 
Regards,
Ouch Whisper
010101010101

Re: Estimating disk space requirements

Posted by Panshul Whisper <ou...@gmail.com>.

Thank you for the replies,

So I take it that I should have atleast 800 GB on total free space on
HDFS.. (combined free space of all the nodes connected to the cluster). So
I can connect 20 nodes having 40 GB of hdd on each node to my cluster. Will
this be enough for the storage?
Please confirm.

Thanking You,
Regards,
Panshul.


On Fri, Jan 18, 2013 at 1:36 PM, Jean-Marc Spaggiari <
jean-marc@spaggiari.org> wrote:

> Hi Panshul,
>
> If you have 20 GB with a replication factor set to 3, you have only
> 6.6GB available, not 11GB. You need to divide the total space by the
> replication factor.
>
> Also, if you store your JSon into HBase, you need to add the key size
> to it. If you key is 4 bytes, or 1024 bytes, it makes a difference.
>
> So roughly, 24 000 000 * 5 * 1024 = 114GB. You don't have the space to
> store it. Without including the key size. Even with a replication
> factor set to 5 you don't have the space.
>
> Now, you can add some compression, but even with a lucky factor of 50%
> you still don't have the space. You will need something like 90%
> compression factor to be able to store this data in your cluster.
>
> A 1T drive is now less than $100... So you might think about replacing
> you 20 GB drives by something bigger.
> to reply to your last question, for your data here, you will need AT
> LEAST 350GB overall storage. But that's a bare minimum. Don't go under
> 500GB.
>
> IMHO
>
> JM
>
> 2013/1/18, Panshul Whisper <ou...@gmail.com>:
> > Hello,
> >
> > I was estimating how much disk space do I need for my cluster.
> >
> > I have 24 million JSON documents approx. 5kb each
> > the Json is to be stored into HBASE with some identifying data in
> coloumns
> > and I also want to store the Json for later retrieval based on the Id
> data
> > as keys in Hbase.
> > I have my HDFS replication set to 3
> > each node has Hadoop and hbase and Ubuntu installed on it.. so approx 11
> GB
> > is available for use on my 20 GB node.
> >
> > I have no idea, if I have not enabled Hbase replication, is the HDFS
> > replication enough to keep the data safe and redundant.
> > How much total disk space I will need for the storage of the data.
> >
> > Please help me estimate this.
> >
> > Thank you so much.
> >
> > --
> > Regards,
> > Ouch Whisper
> > 010101010101
> >
>



-- 
Regards,
Ouch Whisper
010101010101

Re: Estimating disk space requirements

Posted by Jean-Marc Spaggiari <je...@spaggiari.org>.

Hi Panshul,

If you have 20 GB with a replication factor set to 3, you have only
6.6GB available, not 11GB. You need to divide the total space by the
replication factor.

Also, if you store your JSon into HBase, you need to add the key size
to it. If you key is 4 bytes, or 1024 bytes, it makes a difference.

So roughly, 24 000 000 * 5 * 1024 = 114GB. You don't have the space to
store it. Without including the key size. Even with a replication
factor set to 5 you don't have the space.

Now, you can add some compression, but even with a lucky factor of 50%
you still don't have the space. You will need something like 90%
compression factor to be able to store this data in your cluster.

A 1T drive is now less than $100... So you might think about replacing
you 20 GB drives by something bigger.
to reply to your last question, for your data here, you will need AT
LEAST 350GB overall storage. But that's a bare minimum. Don't go under
500GB.

IMHO

JM

2013/1/18, Panshul Whisper <ou...@gmail.com>:
> Hello,
>
> I was estimating how much disk space do I need for my cluster.
>
> I have 24 million JSON documents approx. 5kb each
> the Json is to be stored into HBASE with some identifying data in coloumns
> and I also want to store the Json for later retrieval based on the Id data
> as keys in Hbase.
> I have my HDFS replication set to 3
> each node has Hadoop and hbase and Ubuntu installed on it.. so approx 11 GB
> is available for use on my 20 GB node.
>
> I have no idea, if I have not enabled Hbase replication, is the HDFS
> replication enough to keep the data safe and redundant.
> How much total disk space I will need for the storage of the data.
>
> Please help me estimate this.
>
> Thank you so much.
>
> --
> Regards,
> Ouch Whisper
> 010101010101
>

Re: Estimating disk space requirements

Posted by Jean-Marc Spaggiari <je...@spaggiari.org>.

Hi Panshul,

If you have 20 GB with a replication factor set to 3, you have only
6.6GB available, not 11GB. You need to divide the total space by the
replication factor.

Also, if you store your JSon into HBase, you need to add the key size
to it. If you key is 4 bytes, or 1024 bytes, it makes a difference.

So roughly, 24 000 000 * 5 * 1024 = 114GB. You don't have the space to
store it. Without including the key size. Even with a replication
factor set to 5 you don't have the space.

Now, you can add some compression, but even with a lucky factor of 50%
you still don't have the space. You will need something like 90%
compression factor to be able to store this data in your cluster.

A 1T drive is now less than $100... So you might think about replacing
you 20 GB drives by something bigger.
to reply to your last question, for your data here, you will need AT
LEAST 350GB overall storage. But that's a bare minimum. Don't go under
500GB.

IMHO

JM

2013/1/18, Panshul Whisper <ou...@gmail.com>:
> Hello,
>
> I was estimating how much disk space do I need for my cluster.
>
> I have 24 million JSON documents approx. 5kb each
> the Json is to be stored into HBASE with some identifying data in coloumns
> and I also want to store the Json for later retrieval based on the Id data
> as keys in Hbase.
> I have my HDFS replication set to 3
> each node has Hadoop and hbase and Ubuntu installed on it.. so approx 11 GB
> is available for use on my 20 GB node.
>
> I have no idea, if I have not enabled Hbase replication, is the HDFS
> replication enough to keep the data safe and redundant.
> How much total disk space I will need for the storage of the data.
>
> Please help me estimate this.
>
> Thank you so much.
>
> --
> Regards,
> Ouch Whisper
> 010101010101
>

Re: Estimating disk space requirements

Posted by Jean-Marc Spaggiari <je...@spaggiari.org>.

Hi Panshul,

If you have 20 GB with a replication factor set to 3, you have only
6.6GB available, not 11GB. You need to divide the total space by the
replication factor.

Also, if you store your JSon into HBase, you need to add the key size
to it. If you key is 4 bytes, or 1024 bytes, it makes a difference.

So roughly, 24 000 000 * 5 * 1024 = 114GB. You don't have the space to
store it. Without including the key size. Even with a replication
factor set to 5 you don't have the space.

Now, you can add some compression, but even with a lucky factor of 50%
you still don't have the space. You will need something like 90%
compression factor to be able to store this data in your cluster.

A 1T drive is now less than $100... So you might think about replacing
you 20 GB drives by something bigger.
to reply to your last question, for your data here, you will need AT
LEAST 350GB overall storage. But that's a bare minimum. Don't go under
500GB.

IMHO

JM

2013/1/18, Panshul Whisper <ou...@gmail.com>:
> Hello,
>
> I was estimating how much disk space do I need for my cluster.
>
> I have 24 million JSON documents approx. 5kb each
> the Json is to be stored into HBASE with some identifying data in coloumns
> and I also want to store the Json for later retrieval based on the Id data
> as keys in Hbase.
> I have my HDFS replication set to 3
> each node has Hadoop and hbase and Ubuntu installed on it.. so approx 11 GB
> is available for use on my 20 GB node.
>
> I have no idea, if I have not enabled Hbase replication, is the HDFS
> replication enough to keep the data safe and redundant.
> How much total disk space I will need for the storage of the data.
>
> Please help me estimate this.
>
> Thank you so much.
>
> --
> Regards,
> Ouch Whisper
> 010101010101
>

Re: Estimating disk space requirements

Posted by Mirko Kämpf <mi...@gmail.com>.

Hi,

some comments are inside your message ...


2013/1/18 Panshul Whisper <ou...@gmail.com>

> Hello,
>
> I was estimating how much disk space do I need for my cluster.
>
> I have 24 million JSON documents approx. 5kb each
> the Json is to be stored into HBASE with some identifying data in coloumns
> and I also want to store the Json for later retrieval based on the Id data
> as keys in Hbase.
> I have my HDFS replication set to 3
> each node has Hadoop and hbase and Ubuntu installed on it.. so approx 11
> GB is available for use on my 20 GB node.
>

11 GB is quite small  - or is there a typo?

The amount of raw data is about 115 GB
   *nr of items* *size of an item* *
* *Bytes* *GB*  24 1.00E+006 5 1.02E+003
122880000000 114.4409179688  (without additional key and metadata)

Depending in the amount of overhead this could be about 200GB x 3 is 600GB
just for distributed storage.

And than you need some capacity to store intermediate processing data (20%
to 30%) of the processed data is recommendet.

So you might prepare a capacity of 1TB or even more if your dataset grows.


>
>

> I have no idea, if I have not enabled Hbase replication, is the HDFS
> replication enough to keep the data safe and redundant.
>

The replication on the HDFS level is sufficient for keeping the data safe,
no need to replicate the HBase tables separately.


>  How much total disk space I will need for the storage of the data.
>


>
> Please help me estimate this.
>
> Thank you so much.
>
> --
> Regards,
> Ouch Whisper
> 010101010101
>

Best wishes
Mirko

Re: Estimating disk space requirements

Posted by Jean-Marc Spaggiari <je...@spaggiari.org>.

Hi Panshul,

If you have 20 GB with a replication factor set to 3, you have only
6.6GB available, not 11GB. You need to divide the total space by the
replication factor.

Also, if you store your JSon into HBase, you need to add the key size
to it. If you key is 4 bytes, or 1024 bytes, it makes a difference.

So roughly, 24 000 000 * 5 * 1024 = 114GB. You don't have the space to
store it. Without including the key size. Even with a replication
factor set to 5 you don't have the space.

Now, you can add some compression, but even with a lucky factor of 50%
you still don't have the space. You will need something like 90%
compression factor to be able to store this data in your cluster.

A 1T drive is now less than $100... So you might think about replacing
you 20 GB drives by something bigger.
to reply to your last question, for your data here, you will need AT
LEAST 350GB overall storage. But that's a bare minimum. Don't go under
500GB.

IMHO

JM

2013/1/18, Panshul Whisper <ou...@gmail.com>:
> Hello,
>
> I was estimating how much disk space do I need for my cluster.
>
> I have 24 million JSON documents approx. 5kb each
> the Json is to be stored into HBASE with some identifying data in coloumns
> and I also want to store the Json for later retrieval based on the Id data
> as keys in Hbase.
> I have my HDFS replication set to 3
> each node has Hadoop and hbase and Ubuntu installed on it.. so approx 11 GB
> is available for use on my 20 GB node.
>
> I have no idea, if I have not enabled Hbase replication, is the HDFS
> replication enough to keep the data safe and redundant.
> How much total disk space I will need for the storage of the data.
>
> Please help me estimate this.
>
> Thank you so much.
>
> --
> Regards,
> Ouch Whisper
> 010101010101
>