You are viewing a plain text version of this content. The canonical link for it is here.
Posted to hdfs-user@hadoop.apache.org by YIMEN YIMGA Gael <ga...@sgcib.com> on 2014/07/09 17:59:46 UTC

Need to evaluate a cluster

Hello Dear,

I made an estimation of a number of nodes of a cluster that can be supplied by 720GB of data/day.
My estimation gave me 367 datanodes in a year. I'm a bit afraid by that amount of datanodes.
The assumptions, I used are the followings :


-          Daily supply (feed) : 720GB

-          HDFS replication factor: 3

-          Booked space for each disk outside HDFS: 30%

-          Size of a disk: 3TB.

I have two questions.

First, I would like to know if my assumptions are well taken?
Secondly, could someone help me to evaluate that cluster, to let me be sure that my results are not to excessive, please ?

Standing by for your feedback

Warm regard
*************************************************************************
This message and any attachments (the "message") are confidential, intended solely for the addressee(s), and may contain legally privileged information.
Any unauthorised use or dissemination is prohibited. E-mails are susceptible to alteration.   
Neither SOCIETE GENERALE nor any of its subsidiaries or affiliates shall be liable for the message if altered, changed or
falsified.
Please visit http://swapdisclosure.sgcib.com for important information with respect to derivative products.
                              ************
Ce message et toutes les pieces jointes (ci-apres le "message") sont confidentiels et susceptibles de contenir des informations couvertes 
par le secret professionnel. 
Ce message est etabli a l'intention exclusive de ses destinataires. Toute utilisation ou diffusion non autorisee est interdite.
Tout message electronique est susceptible d'alteration. 
La SOCIETE GENERALE et ses filiales declinent toute responsabilite au titre de ce message s'il a ete altere, deforme ou falsifie.
Veuillez consulter le site http://swapdisclosure.sgcib.com afin de recueillir d'importantes informations sur les produits derives.
*************************************************************************

RE: Need to evaluate a cluster

Posted by YIMEN YIMGA Gael <ga...@sgcib.com>.
Hi,

To Mirko
The number of HDDs per datanodes is : 3 (3 disks of 1TB to 3TB)

I calculate the number of nodes using the following formulae

=======

-       Used space on the cluster by daily feed : <daily feed> * <replication factor> = 720GB * 3 = 2160GB

-       Size of a disk for HDFS : <Size of a disk> * (1 – <booked space for each disk out HDFS>) = 3TB * (1 – 30%) = 2.1TB

-       Number of datanodes in a year (without monthly data increasing) : <used space on the cluster by daily feed> * 365 / <size of a disk for HDFS> = 2160GB * 365/1024*2.1 = 367 datanodes
=======

To Olivier

About compression of data, No, I assumed data will not be compressed.
How to use compression ratio in my calculation ?

Standing by

From: Olivier Renault [mailto:orenault@hortonworks.com]
Sent: Wednesday 9 July 2014 18:51
To: user@hadoop.apache.org
Subject: Re: Need to evaluate a cluster


Is your data already compressed? If it's not you can safely assume a compression ratio of 5.

Olivier
On 9 Jul 2014 17:10, "Mirko Kämpf" <mi...@gmail.com>> wrote:
Hello,

if I follow your numbers I see one missing fact: What is the number of HDDs per DataNode?
Let's assume you use machines with 6 x 3TB HDDs per box, you would need about 60 DataNodes
per year (0.75 TB per day x 3 for replication x 1.3 for overhead / ( nr of HDDs per node x capacity per HDD )).
With 12 HDD you would only need 30 servers per year.
How did you calculate the number of 367 datanodes?

Cheers,
Mirko

2014-07-09 17:59 GMT+02:00 YIMEN YIMGA Gael <ga...@sgcib.com>>:
Hello Dear,

I made an estimation of a number of nodes of a cluster that can be supplied by 720GB of data/day.
My estimation gave me 367 datanodes in a year. I’m a bit afraid by that amount of datanodes.
The assumptions, I used are the followings :


-          Daily supply (feed) : 720GB

-          HDFS replication factor: 3

-          Booked space for each disk outside HDFS: 30%

-          Size of a disk: 3TB.

I have two questions.

First, I would like to know if my assumptions are well taken?
Secondly, could someone help me to evaluate that cluster, to let me be sure that my results are not to excessive, please ?

Standing by for your feedback

Warm regard

*************************************************************************
This message and any attachments (the "message") are confidential, intended solely for the addressee(s), and may contain legally privileged information.
Any unauthorised use or dissemination is prohibited. E-mails are susceptible to alteration.
Neither SOCIETE GENERALE nor any of its subsidiaries or affiliates shall be liable for the message if altered, changed or
falsified.
Please visit http://swapdisclosure.sgcib.com for important information with respect to derivative products.
                              ************
Ce message et toutes les pieces jointes (ci-apres le "message") sont confidentiels et susceptibles de contenir des informations couvertes
par le secret professionnel.
Ce message est etabli a l'intention exclusive de ses destinataires. Toute utilisation ou diffusion non autorisee est interdite.
Tout message electronique est susceptible d'alteration.
La SOCIETE GENERALE et ses filiales declinent toute responsabilite au titre de ce message s'il a ete altere, deforme ou falsifie.
Veuillez consulter le site http://swapdisclosure.sgcib.com afin de recueillir d'importantes informations sur les produits derives.
*************************************************************************


CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.

RE: Need to evaluate a cluster

Posted by YIMEN YIMGA Gael <ga...@sgcib.com>.
Hi,

To Mirko
The number of HDDs per datanodes is : 3 (3 disks of 1TB to 3TB)

I calculate the number of nodes using the following formulae

=======

-       Used space on the cluster by daily feed : <daily feed> * <replication factor> = 720GB * 3 = 2160GB

-       Size of a disk for HDFS : <Size of a disk> * (1 – <booked space for each disk out HDFS>) = 3TB * (1 – 30%) = 2.1TB

-       Number of datanodes in a year (without monthly data increasing) : <used space on the cluster by daily feed> * 365 / <size of a disk for HDFS> = 2160GB * 365/1024*2.1 = 367 datanodes
=======

To Olivier

About compression of data, No, I assumed data will not be compressed.
How to use compression ratio in my calculation ?

Standing by

From: Olivier Renault [mailto:orenault@hortonworks.com]
Sent: Wednesday 9 July 2014 18:51
To: user@hadoop.apache.org
Subject: Re: Need to evaluate a cluster


Is your data already compressed? If it's not you can safely assume a compression ratio of 5.

Olivier
On 9 Jul 2014 17:10, "Mirko Kämpf" <mi...@gmail.com>> wrote:
Hello,

if I follow your numbers I see one missing fact: What is the number of HDDs per DataNode?
Let's assume you use machines with 6 x 3TB HDDs per box, you would need about 60 DataNodes
per year (0.75 TB per day x 3 for replication x 1.3 for overhead / ( nr of HDDs per node x capacity per HDD )).
With 12 HDD you would only need 30 servers per year.
How did you calculate the number of 367 datanodes?

Cheers,
Mirko

2014-07-09 17:59 GMT+02:00 YIMEN YIMGA Gael <ga...@sgcib.com>>:
Hello Dear,

I made an estimation of a number of nodes of a cluster that can be supplied by 720GB of data/day.
My estimation gave me 367 datanodes in a year. I’m a bit afraid by that amount of datanodes.
The assumptions, I used are the followings :


-          Daily supply (feed) : 720GB

-          HDFS replication factor: 3

-          Booked space for each disk outside HDFS: 30%

-          Size of a disk: 3TB.

I have two questions.

First, I would like to know if my assumptions are well taken?
Secondly, could someone help me to evaluate that cluster, to let me be sure that my results are not to excessive, please ?

Standing by for your feedback

Warm regard

*************************************************************************
This message and any attachments (the "message") are confidential, intended solely for the addressee(s), and may contain legally privileged information.
Any unauthorised use or dissemination is prohibited. E-mails are susceptible to alteration.
Neither SOCIETE GENERALE nor any of its subsidiaries or affiliates shall be liable for the message if altered, changed or
falsified.
Please visit http://swapdisclosure.sgcib.com for important information with respect to derivative products.
                              ************
Ce message et toutes les pieces jointes (ci-apres le "message") sont confidentiels et susceptibles de contenir des informations couvertes
par le secret professionnel.
Ce message est etabli a l'intention exclusive de ses destinataires. Toute utilisation ou diffusion non autorisee est interdite.
Tout message electronique est susceptible d'alteration.
La SOCIETE GENERALE et ses filiales declinent toute responsabilite au titre de ce message s'il a ete altere, deforme ou falsifie.
Veuillez consulter le site http://swapdisclosure.sgcib.com afin de recueillir d'importantes informations sur les produits derives.
*************************************************************************


CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.

RE: Need to evaluate a cluster

Posted by YIMEN YIMGA Gael <ga...@sgcib.com>.
Hi,

To Mirko
The number of HDDs per datanodes is : 3 (3 disks of 1TB to 3TB)

I calculate the number of nodes using the following formulae

=======

-       Used space on the cluster by daily feed : <daily feed> * <replication factor> = 720GB * 3 = 2160GB

-       Size of a disk for HDFS : <Size of a disk> * (1 – <booked space for each disk out HDFS>) = 3TB * (1 – 30%) = 2.1TB

-       Number of datanodes in a year (without monthly data increasing) : <used space on the cluster by daily feed> * 365 / <size of a disk for HDFS> = 2160GB * 365/1024*2.1 = 367 datanodes
=======

To Olivier

About compression of data, No, I assumed data will not be compressed.
How to use compression ratio in my calculation ?

Standing by

From: Olivier Renault [mailto:orenault@hortonworks.com]
Sent: Wednesday 9 July 2014 18:51
To: user@hadoop.apache.org
Subject: Re: Need to evaluate a cluster


Is your data already compressed? If it's not you can safely assume a compression ratio of 5.

Olivier
On 9 Jul 2014 17:10, "Mirko Kämpf" <mi...@gmail.com>> wrote:
Hello,

if I follow your numbers I see one missing fact: What is the number of HDDs per DataNode?
Let's assume you use machines with 6 x 3TB HDDs per box, you would need about 60 DataNodes
per year (0.75 TB per day x 3 for replication x 1.3 for overhead / ( nr of HDDs per node x capacity per HDD )).
With 12 HDD you would only need 30 servers per year.
How did you calculate the number of 367 datanodes?

Cheers,
Mirko

2014-07-09 17:59 GMT+02:00 YIMEN YIMGA Gael <ga...@sgcib.com>>:
Hello Dear,

I made an estimation of a number of nodes of a cluster that can be supplied by 720GB of data/day.
My estimation gave me 367 datanodes in a year. I’m a bit afraid by that amount of datanodes.
The assumptions, I used are the followings :


-          Daily supply (feed) : 720GB

-          HDFS replication factor: 3

-          Booked space for each disk outside HDFS: 30%

-          Size of a disk: 3TB.

I have two questions.

First, I would like to know if my assumptions are well taken?
Secondly, could someone help me to evaluate that cluster, to let me be sure that my results are not to excessive, please ?

Standing by for your feedback

Warm regard

*************************************************************************
This message and any attachments (the "message") are confidential, intended solely for the addressee(s), and may contain legally privileged information.
Any unauthorised use or dissemination is prohibited. E-mails are susceptible to alteration.
Neither SOCIETE GENERALE nor any of its subsidiaries or affiliates shall be liable for the message if altered, changed or
falsified.
Please visit http://swapdisclosure.sgcib.com for important information with respect to derivative products.
                              ************
Ce message et toutes les pieces jointes (ci-apres le "message") sont confidentiels et susceptibles de contenir des informations couvertes
par le secret professionnel.
Ce message est etabli a l'intention exclusive de ses destinataires. Toute utilisation ou diffusion non autorisee est interdite.
Tout message electronique est susceptible d'alteration.
La SOCIETE GENERALE et ses filiales declinent toute responsabilite au titre de ce message s'il a ete altere, deforme ou falsifie.
Veuillez consulter le site http://swapdisclosure.sgcib.com afin de recueillir d'importantes informations sur les produits derives.
*************************************************************************


CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.

RE: Need to evaluate a cluster

Posted by YIMEN YIMGA Gael <ga...@sgcib.com>.
Hi,

To Mirko
The number of HDDs per datanodes is : 3 (3 disks of 1TB to 3TB)

I calculate the number of nodes using the following formulae

=======

-       Used space on the cluster by daily feed : <daily feed> * <replication factor> = 720GB * 3 = 2160GB

-       Size of a disk for HDFS : <Size of a disk> * (1 – <booked space for each disk out HDFS>) = 3TB * (1 – 30%) = 2.1TB

-       Number of datanodes in a year (without monthly data increasing) : <used space on the cluster by daily feed> * 365 / <size of a disk for HDFS> = 2160GB * 365/1024*2.1 = 367 datanodes
=======

To Olivier

About compression of data, No, I assumed data will not be compressed.
How to use compression ratio in my calculation ?

Standing by

From: Olivier Renault [mailto:orenault@hortonworks.com]
Sent: Wednesday 9 July 2014 18:51
To: user@hadoop.apache.org
Subject: Re: Need to evaluate a cluster


Is your data already compressed? If it's not you can safely assume a compression ratio of 5.

Olivier
On 9 Jul 2014 17:10, "Mirko Kämpf" <mi...@gmail.com>> wrote:
Hello,

if I follow your numbers I see one missing fact: What is the number of HDDs per DataNode?
Let's assume you use machines with 6 x 3TB HDDs per box, you would need about 60 DataNodes
per year (0.75 TB per day x 3 for replication x 1.3 for overhead / ( nr of HDDs per node x capacity per HDD )).
With 12 HDD you would only need 30 servers per year.
How did you calculate the number of 367 datanodes?

Cheers,
Mirko

2014-07-09 17:59 GMT+02:00 YIMEN YIMGA Gael <ga...@sgcib.com>>:
Hello Dear,

I made an estimation of a number of nodes of a cluster that can be supplied by 720GB of data/day.
My estimation gave me 367 datanodes in a year. I’m a bit afraid by that amount of datanodes.
The assumptions, I used are the followings :


-          Daily supply (feed) : 720GB

-          HDFS replication factor: 3

-          Booked space for each disk outside HDFS: 30%

-          Size of a disk: 3TB.

I have two questions.

First, I would like to know if my assumptions are well taken?
Secondly, could someone help me to evaluate that cluster, to let me be sure that my results are not to excessive, please ?

Standing by for your feedback

Warm regard

*************************************************************************
This message and any attachments (the "message") are confidential, intended solely for the addressee(s), and may contain legally privileged information.
Any unauthorised use or dissemination is prohibited. E-mails are susceptible to alteration.
Neither SOCIETE GENERALE nor any of its subsidiaries or affiliates shall be liable for the message if altered, changed or
falsified.
Please visit http://swapdisclosure.sgcib.com for important information with respect to derivative products.
                              ************
Ce message et toutes les pieces jointes (ci-apres le "message") sont confidentiels et susceptibles de contenir des informations couvertes
par le secret professionnel.
Ce message est etabli a l'intention exclusive de ses destinataires. Toute utilisation ou diffusion non autorisee est interdite.
Tout message electronique est susceptible d'alteration.
La SOCIETE GENERALE et ses filiales declinent toute responsabilite au titre de ce message s'il a ete altere, deforme ou falsifie.
Veuillez consulter le site http://swapdisclosure.sgcib.com afin de recueillir d'importantes informations sur les produits derives.
*************************************************************************


CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.

Re: Need to evaluate a cluster

Posted by Olivier Renault <or...@hortonworks.com>.
Is your data already compressed? If it's not you can safely assume a
compression ratio of 5.

Olivier
On 9 Jul 2014 17:10, "Mirko Kämpf" <mi...@gmail.com> wrote:

> Hello,
>
> if I follow your numbers I see one missing fact: *What is the number of
> HDDs per DataNode*?
> Let's assume you use machines with 6 x 3TB HDDs per box, you would need
> about 60 DataNodes
> per year (0.75 TB per day x 3 for replication x 1.3 for overhead / ( nr of
> HDDs per node x capacity per HDD )).
> With 12 HDD you would only need 30 servers per year.
> How did you calculate the number of 367 datanodes?
>
> Cheers,
> Mirko
>
>
> 2014-07-09 17:59 GMT+02:00 YIMEN YIMGA Gael <ga...@sgcib.com>:
>
>> Hello Dear,
>>
>>
>>
>> I made an estimation of a number of nodes of a cluster that can be
>> supplied by 720GB of data/day.
>>
>> My estimation gave me *367 datanodes* in a year. I’m a bit afraid by
>> that amount of datanodes.
>>
>> The assumptions, I used are the followings :
>>
>>
>>
>> -          Daily supply (feed) : 720GB
>>
>> -          HDFS replication factor: 3
>>
>> -          Booked space for each disk outside HDFS: 30%
>>
>> -          Size of a disk: 3TB.
>>
>>
>>
>> I have two questions.
>>
>>
>>
>> First, I would like to know if my assumptions are well taken?
>>
>> Secondly, could someone help me to evaluate that cluster, to let me be
>> sure that my results are not to excessive, please ?
>>
>>
>>
>> Standing by for your feedback
>>
>>
>>
>> Warm regard
>>
>> *************************************************************************
>> This message and any attachments (the "message") are confidential,
>> intended solely for the addressee(s), and may contain legally privileged
>> information.
>> Any unauthorised use or dissemination is prohibited. E-mails are
>> susceptible to alteration.
>> Neither SOCIETE GENERALE nor any of its subsidiaries or affiliates shall
>> be liable for the message if altered, changed or
>> falsified.
>> Please visit http://swapdisclosure.sgcib.com for important information
>> with respect to derivative products.
>>                               ************
>> Ce message et toutes les pieces jointes (ci-apres le "message") sont
>> confidentiels et susceptibles de contenir des informations couvertes
>> par le secret professionnel.
>> Ce message est etabli a l'intention exclusive de ses destinataires. Toute
>> utilisation ou diffusion non autorisee est interdite.
>> Tout message electronique est susceptible d'alteration.
>> La SOCIETE GENERALE et ses filiales declinent toute responsabilite au
>> titre de ce message s'il a ete altere, deforme ou falsifie.
>> Veuillez consulter le site http://swapdisclosure.sgcib.com afin de
>> recueillir d'importantes informations sur les produits derives.
>> *************************************************************************
>>
>
>

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

RE: Need to evaluate a cluster

Posted by YIMEN YIMGA Gael <ga...@sgcib.com>.
In addition, when I applied the Compression factor of 8, I have as daily feeds : 87GB/day

From: YIMEN YIMGA Gael ItecCsySat
Sent: Thursday 10 July 2014 11:11
To: user@hadoop.apache.org
Subject: RE: Need to evaluate a cluster

Thank for your return Mirko,

In my case, I can consider compression factor of *8 according to the service in charge of it.

Data, I’m dealing with are : logs only. But it’s many types of logs (printing logs, USB logs, Remote access logs, Active Directory logs, database servers logs, Web servers logs, Antivirus logs, etc.)
I precise that in my case it’s only logs that are stored. Sometime we could have CSV files. But no videos or images are considered here.

Any advice according to that specific type of data?
What are the reasons to consider servers with 12 HDD (3TB) per server? Knowing that, I prefer the LOW-COST.
What could be the price of a LOW-COST server with 12HDD (3TB) ?

Regards

From: Mirko Kämpf [mailto:mirko.kaempf@gmail.com]
Sent: Thursday 10 July 2014 11:01
To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Re: Need to evaluate a cluster

I multiply by 1.3 which means I add 30% of the estimated amount to have reserved capacity for intermediate data.
In your case with approx. 2TB per day I think, data nodes with 1 to 3 discs are not a good idea. You should consider servers with more discs and than add one per week. Start with 10 servers and 12 HDD (3TB) per server. This allows you to handle approx. 35 TB raw uncompressed data. You have to evaluated compression in your special case. It can be high, but also not very high, if raw data is already compressed somehow. What data are you dealing with?
Text, messages, logs or more binary data like images, mp3 oder video formats?
Cheers,
Mirko

2014-07-10 10:43 GMT+02:00 YIMEN YIMGA Gael <ga...@sgcib.com>>:
Hi,

What does « 1.3 for overhead » mean in this calculation ?

Regards

From: Mirko Kämpf [mailto:mirko.kaempf@gmail.com<ma...@gmail.com>]
Sent: Wednesday 9 July 2014 18:09

To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Re: Need to evaluate a cluster

Hello,

if I follow your numbers I see one missing fact: What is the number of HDDs per DataNode?
Let's assume you use machines with 6 x 3TB HDDs per box, you would need about 60 DataNodes
per year (0.75 TB per day x 3 for replication x 1.3 for overhead / ( nr of HDDs per node x capacity per HDD )).
With 12 HDD you would only need 30 servers per year.
How did you calculate the number of 367 datanodes?

Cheers,
Mirko

2014-07-09 17:59 GMT+02:00 YIMEN YIMGA Gael <ga...@sgcib.com>>:
Hello Dear,

I made an estimation of a number of nodes of a cluster that can be supplied by 720GB of data/day.
My estimation gave me 367 datanodes in a year. I’m a bit afraid by that amount of datanodes.
The assumptions, I used are the followings :


-          Daily supply (feed) : 720GB

-          HDFS replication factor: 3

-          Booked space for each disk outside HDFS: 30%

-          Size of a disk: 3TB.

I have two questions.

First, I would like to know if my assumptions are well taken?
Secondly, could someone help me to evaluate that cluster, to let me be sure that my results are not to excessive, please ?

Standing by for your feedback

Warm regard

*************************************************************************
This message and any attachments (the "message") are confidential, intended solely for the addressee(s), and may contain legally privileged information.
Any unauthorised use or dissemination is prohibited. E-mails are susceptible to alteration.
Neither SOCIETE GENERALE nor any of its subsidiaries or affiliates shall be liable for the message if altered, changed or
falsified.
Please visit http://swapdisclosure.sgcib.com for important information with respect to derivative products.
                              ************
Ce message et toutes les pieces jointes (ci-apres le "message") sont confidentiels et susceptibles de contenir des informations couvertes
par le secret professionnel.
Ce message est etabli a l'intention exclusive de ses destinataires. Toute utilisation ou diffusion non autorisee est interdite.
Tout message electronique est susceptible d'alteration.
La SOCIETE GENERALE et ses filiales declinent toute responsabilite au titre de ce message s'il a ete altere, deforme ou falsifie.
Veuillez consulter le site http://swapdisclosure.sgcib.com afin de recueillir d'importantes informations sur les produits derives.
*************************************************************************



RE: Need to evaluate a cluster

Posted by YIMEN YIMGA Gael <ga...@sgcib.com>.
In addition, when I applied the Compression factor of 8, I have as daily feeds : 87GB/day

From: YIMEN YIMGA Gael ItecCsySat
Sent: Thursday 10 July 2014 11:11
To: user@hadoop.apache.org
Subject: RE: Need to evaluate a cluster

Thank for your return Mirko,

In my case, I can consider compression factor of *8 according to the service in charge of it.

Data, I’m dealing with are : logs only. But it’s many types of logs (printing logs, USB logs, Remote access logs, Active Directory logs, database servers logs, Web servers logs, Antivirus logs, etc.)
I precise that in my case it’s only logs that are stored. Sometime we could have CSV files. But no videos or images are considered here.

Any advice according to that specific type of data?
What are the reasons to consider servers with 12 HDD (3TB) per server? Knowing that, I prefer the LOW-COST.
What could be the price of a LOW-COST server with 12HDD (3TB) ?

Regards

From: Mirko Kämpf [mailto:mirko.kaempf@gmail.com]
Sent: Thursday 10 July 2014 11:01
To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Re: Need to evaluate a cluster

I multiply by 1.3 which means I add 30% of the estimated amount to have reserved capacity for intermediate data.
In your case with approx. 2TB per day I think, data nodes with 1 to 3 discs are not a good idea. You should consider servers with more discs and than add one per week. Start with 10 servers and 12 HDD (3TB) per server. This allows you to handle approx. 35 TB raw uncompressed data. You have to evaluated compression in your special case. It can be high, but also not very high, if raw data is already compressed somehow. What data are you dealing with?
Text, messages, logs or more binary data like images, mp3 oder video formats?
Cheers,
Mirko

2014-07-10 10:43 GMT+02:00 YIMEN YIMGA Gael <ga...@sgcib.com>>:
Hi,

What does « 1.3 for overhead » mean in this calculation ?

Regards

From: Mirko Kämpf [mailto:mirko.kaempf@gmail.com<ma...@gmail.com>]
Sent: Wednesday 9 July 2014 18:09

To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Re: Need to evaluate a cluster

Hello,

if I follow your numbers I see one missing fact: What is the number of HDDs per DataNode?
Let's assume you use machines with 6 x 3TB HDDs per box, you would need about 60 DataNodes
per year (0.75 TB per day x 3 for replication x 1.3 for overhead / ( nr of HDDs per node x capacity per HDD )).
With 12 HDD you would only need 30 servers per year.
How did you calculate the number of 367 datanodes?

Cheers,
Mirko

2014-07-09 17:59 GMT+02:00 YIMEN YIMGA Gael <ga...@sgcib.com>>:
Hello Dear,

I made an estimation of a number of nodes of a cluster that can be supplied by 720GB of data/day.
My estimation gave me 367 datanodes in a year. I’m a bit afraid by that amount of datanodes.
The assumptions, I used are the followings :


-          Daily supply (feed) : 720GB

-          HDFS replication factor: 3

-          Booked space for each disk outside HDFS: 30%

-          Size of a disk: 3TB.

I have two questions.

First, I would like to know if my assumptions are well taken?
Secondly, could someone help me to evaluate that cluster, to let me be sure that my results are not to excessive, please ?

Standing by for your feedback

Warm regard

*************************************************************************
This message and any attachments (the "message") are confidential, intended solely for the addressee(s), and may contain legally privileged information.
Any unauthorised use or dissemination is prohibited. E-mails are susceptible to alteration.
Neither SOCIETE GENERALE nor any of its subsidiaries or affiliates shall be liable for the message if altered, changed or
falsified.
Please visit http://swapdisclosure.sgcib.com for important information with respect to derivative products.
                              ************
Ce message et toutes les pieces jointes (ci-apres le "message") sont confidentiels et susceptibles de contenir des informations couvertes
par le secret professionnel.
Ce message est etabli a l'intention exclusive de ses destinataires. Toute utilisation ou diffusion non autorisee est interdite.
Tout message electronique est susceptible d'alteration.
La SOCIETE GENERALE et ses filiales declinent toute responsabilite au titre de ce message s'il a ete altere, deforme ou falsifie.
Veuillez consulter le site http://swapdisclosure.sgcib.com afin de recueillir d'importantes informations sur les produits derives.
*************************************************************************



RE: Need to evaluate a cluster

Posted by YIMEN YIMGA Gael <ga...@sgcib.com>.
In addition, when I applied the Compression factor of 8, I have as daily feeds : 87GB/day

From: YIMEN YIMGA Gael ItecCsySat
Sent: Thursday 10 July 2014 11:11
To: user@hadoop.apache.org
Subject: RE: Need to evaluate a cluster

Thank for your return Mirko,

In my case, I can consider compression factor of *8 according to the service in charge of it.

Data, I’m dealing with are : logs only. But it’s many types of logs (printing logs, USB logs, Remote access logs, Active Directory logs, database servers logs, Web servers logs, Antivirus logs, etc.)
I precise that in my case it’s only logs that are stored. Sometime we could have CSV files. But no videos or images are considered here.

Any advice according to that specific type of data?
What are the reasons to consider servers with 12 HDD (3TB) per server? Knowing that, I prefer the LOW-COST.
What could be the price of a LOW-COST server with 12HDD (3TB) ?

Regards

From: Mirko Kämpf [mailto:mirko.kaempf@gmail.com]
Sent: Thursday 10 July 2014 11:01
To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Re: Need to evaluate a cluster

I multiply by 1.3 which means I add 30% of the estimated amount to have reserved capacity for intermediate data.
In your case with approx. 2TB per day I think, data nodes with 1 to 3 discs are not a good idea. You should consider servers with more discs and than add one per week. Start with 10 servers and 12 HDD (3TB) per server. This allows you to handle approx. 35 TB raw uncompressed data. You have to evaluated compression in your special case. It can be high, but also not very high, if raw data is already compressed somehow. What data are you dealing with?
Text, messages, logs or more binary data like images, mp3 oder video formats?
Cheers,
Mirko

2014-07-10 10:43 GMT+02:00 YIMEN YIMGA Gael <ga...@sgcib.com>>:
Hi,

What does « 1.3 for overhead » mean in this calculation ?

Regards

From: Mirko Kämpf [mailto:mirko.kaempf@gmail.com<ma...@gmail.com>]
Sent: Wednesday 9 July 2014 18:09

To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Re: Need to evaluate a cluster

Hello,

if I follow your numbers I see one missing fact: What is the number of HDDs per DataNode?
Let's assume you use machines with 6 x 3TB HDDs per box, you would need about 60 DataNodes
per year (0.75 TB per day x 3 for replication x 1.3 for overhead / ( nr of HDDs per node x capacity per HDD )).
With 12 HDD you would only need 30 servers per year.
How did you calculate the number of 367 datanodes?

Cheers,
Mirko

2014-07-09 17:59 GMT+02:00 YIMEN YIMGA Gael <ga...@sgcib.com>>:
Hello Dear,

I made an estimation of a number of nodes of a cluster that can be supplied by 720GB of data/day.
My estimation gave me 367 datanodes in a year. I’m a bit afraid by that amount of datanodes.
The assumptions, I used are the followings :


-          Daily supply (feed) : 720GB

-          HDFS replication factor: 3

-          Booked space for each disk outside HDFS: 30%

-          Size of a disk: 3TB.

I have two questions.

First, I would like to know if my assumptions are well taken?
Secondly, could someone help me to evaluate that cluster, to let me be sure that my results are not to excessive, please ?

Standing by for your feedback

Warm regard

*************************************************************************
This message and any attachments (the "message") are confidential, intended solely for the addressee(s), and may contain legally privileged information.
Any unauthorised use or dissemination is prohibited. E-mails are susceptible to alteration.
Neither SOCIETE GENERALE nor any of its subsidiaries or affiliates shall be liable for the message if altered, changed or
falsified.
Please visit http://swapdisclosure.sgcib.com for important information with respect to derivative products.
                              ************
Ce message et toutes les pieces jointes (ci-apres le "message") sont confidentiels et susceptibles de contenir des informations couvertes
par le secret professionnel.
Ce message est etabli a l'intention exclusive de ses destinataires. Toute utilisation ou diffusion non autorisee est interdite.
Tout message electronique est susceptible d'alteration.
La SOCIETE GENERALE et ses filiales declinent toute responsabilite au titre de ce message s'il a ete altere, deforme ou falsifie.
Veuillez consulter le site http://swapdisclosure.sgcib.com afin de recueillir d'importantes informations sur les produits derives.
*************************************************************************



RE: Need to evaluate a cluster

Posted by YIMEN YIMGA Gael <ga...@sgcib.com>.
Thank you Mirko, i saw the chapter title PLANNING A HADOOP CLUSTER.

I’ll take that book.

From: Mirko Kämpf [mailto:mirko.kaempf@gmail.com]
Sent: Thursday 10 July 2014 11:22
To: user@hadoop.apache.org
Subject: Re: Need to evaluate a cluster

Just request a quote from the leading and also local vendors. Tell them about the volume and the access pattern you have in mind and collect the offerings. Than you compare the prices. You should consider space (in the data center) network architecture and energy cosumption as well as heat generation in such a cluster which handles some PB on the long run.
Have a look into the book: http://www.amazon.de/Hadoop-Operations-Eric-Sammer/dp/1449327052
Cheers,
Mirko


2014-07-10 11:10 GMT+02:00 YIMEN YIMGA Gael <ga...@sgcib.com>>:
Thank for your return Mirko,

In my case, I can consider compression factor of *8 according to the service in charge of it.

Data, I’m dealing with are : logs only. But it’s many types of logs (printing logs, USB logs, Remote access logs, Active Directory logs, database servers logs, Web servers logs, Antivirus logs, etc.)
I precise that in my case it’s only logs that are stored. Sometime we could have CSV files. But no videos or images are considered here.

Any advice according to that specific type of data?
What are the reasons to consider servers with 12 HDD (3TB) per server? Knowing that, I prefer the LOW-COST.
What could be the price of a LOW-COST server with 12HDD (3TB) ?

Regards

From: Mirko Kämpf [mailto:mirko.kaempf@gmail.com<ma...@gmail.com>]
Sent: Thursday 10 July 2014 11:01

To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Re: Need to evaluate a cluster

I multiply by 1.3 which means I add 30% of the estimated amount to have reserved capacity for intermediate data.
In your case with approx. 2TB per day I think, data nodes with 1 to 3 discs are not a good idea. You should consider servers with more discs and than add one per week. Start with 10 servers and 12 HDD (3TB) per server. This allows you to handle approx. 35 TB raw uncompressed data. You have to evaluated compression in your special case. It can be high, but also not very high, if raw data is already compressed somehow. What data are you dealing with?
Text, messages, logs or more binary data like images, mp3 oder video formats?
Cheers,
Mirko

2014-07-10 10:43 GMT+02:00 YIMEN YIMGA Gael <ga...@sgcib.com>>:
Hi,

What does « 1.3 for overhead » mean in this calculation ?

Regards

From: Mirko Kämpf [mailto:mirko.kaempf@gmail.com<ma...@gmail.com>]
Sent: Wednesday 9 July 2014 18:09

To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Re: Need to evaluate a cluster

Hello,

if I follow your numbers I see one missing fact: What is the number of HDDs per DataNode?
Let's assume you use machines with 6 x 3TB HDDs per box, you would need about 60 DataNodes
per year (0.75 TB per day x 3 for replication x 1.3 for overhead / ( nr of HDDs per node x capacity per HDD )).
With 12 HDD you would only need 30 servers per year.
How did you calculate the number of 367 datanodes?

Cheers,
Mirko

2014-07-09 17:59 GMT+02:00 YIMEN YIMGA Gael <ga...@sgcib.com>>:
Hello Dear,

I made an estimation of a number of nodes of a cluster that can be supplied by 720GB of data/day.
My estimation gave me 367 datanodes in a year. I’m a bit afraid by that amount of datanodes.
The assumptions, I used are the followings :


-          Daily supply (feed) : 720GB

-          HDFS replication factor: 3

-          Booked space for each disk outside HDFS: 30%

-          Size of a disk: 3TB.

I have two questions.

First, I would like to know if my assumptions are well taken?
Secondly, could someone help me to evaluate that cluster, to let me be sure that my results are not to excessive, please ?

Standing by for your feedback

Warm regard

*************************************************************************
This message and any attachments (the "message") are confidential, intended solely for the addressee(s), and may contain legally privileged information.
Any unauthorised use or dissemination is prohibited. E-mails are susceptible to alteration.
Neither SOCIETE GENERALE nor any of its subsidiaries or affiliates shall be liable for the message if altered, changed or
falsified.
Please visit http://swapdisclosure.sgcib.com for important information with respect to derivative products.
                              ************
Ce message et toutes les pieces jointes (ci-apres le "message") sont confidentiels et susceptibles de contenir des informations couvertes
par le secret professionnel.
Ce message est etabli a l'intention exclusive de ses destinataires. Toute utilisation ou diffusion non autorisee est interdite.
Tout message electronique est susceptible d'alteration.
La SOCIETE GENERALE et ses filiales declinent toute responsabilite au titre de ce message s'il a ete altere, deforme ou falsifie.
Veuillez consulter le site http://swapdisclosure.sgcib.com afin de recueillir d'importantes informations sur les produits derives.
*************************************************************************




RE: Need to evaluate a cluster

Posted by YIMEN YIMGA Gael <ga...@sgcib.com>.
Thank you Mirko, i saw the chapter title PLANNING A HADOOP CLUSTER.

I’ll take that book.

From: Mirko Kämpf [mailto:mirko.kaempf@gmail.com]
Sent: Thursday 10 July 2014 11:22
To: user@hadoop.apache.org
Subject: Re: Need to evaluate a cluster

Just request a quote from the leading and also local vendors. Tell them about the volume and the access pattern you have in mind and collect the offerings. Than you compare the prices. You should consider space (in the data center) network architecture and energy cosumption as well as heat generation in such a cluster which handles some PB on the long run.
Have a look into the book: http://www.amazon.de/Hadoop-Operations-Eric-Sammer/dp/1449327052
Cheers,
Mirko


2014-07-10 11:10 GMT+02:00 YIMEN YIMGA Gael <ga...@sgcib.com>>:
Thank for your return Mirko,

In my case, I can consider compression factor of *8 according to the service in charge of it.

Data, I’m dealing with are : logs only. But it’s many types of logs (printing logs, USB logs, Remote access logs, Active Directory logs, database servers logs, Web servers logs, Antivirus logs, etc.)
I precise that in my case it’s only logs that are stored. Sometime we could have CSV files. But no videos or images are considered here.

Any advice according to that specific type of data?
What are the reasons to consider servers with 12 HDD (3TB) per server? Knowing that, I prefer the LOW-COST.
What could be the price of a LOW-COST server with 12HDD (3TB) ?

Regards

From: Mirko Kämpf [mailto:mirko.kaempf@gmail.com<ma...@gmail.com>]
Sent: Thursday 10 July 2014 11:01

To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Re: Need to evaluate a cluster

I multiply by 1.3 which means I add 30% of the estimated amount to have reserved capacity for intermediate data.
In your case with approx. 2TB per day I think, data nodes with 1 to 3 discs are not a good idea. You should consider servers with more discs and than add one per week. Start with 10 servers and 12 HDD (3TB) per server. This allows you to handle approx. 35 TB raw uncompressed data. You have to evaluated compression in your special case. It can be high, but also not very high, if raw data is already compressed somehow. What data are you dealing with?
Text, messages, logs or more binary data like images, mp3 oder video formats?
Cheers,
Mirko

2014-07-10 10:43 GMT+02:00 YIMEN YIMGA Gael <ga...@sgcib.com>>:
Hi,

What does « 1.3 for overhead » mean in this calculation ?

Regards

From: Mirko Kämpf [mailto:mirko.kaempf@gmail.com<ma...@gmail.com>]
Sent: Wednesday 9 July 2014 18:09

To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Re: Need to evaluate a cluster

Hello,

if I follow your numbers I see one missing fact: What is the number of HDDs per DataNode?
Let's assume you use machines with 6 x 3TB HDDs per box, you would need about 60 DataNodes
per year (0.75 TB per day x 3 for replication x 1.3 for overhead / ( nr of HDDs per node x capacity per HDD )).
With 12 HDD you would only need 30 servers per year.
How did you calculate the number of 367 datanodes?

Cheers,
Mirko

2014-07-09 17:59 GMT+02:00 YIMEN YIMGA Gael <ga...@sgcib.com>>:
Hello Dear,

I made an estimation of a number of nodes of a cluster that can be supplied by 720GB of data/day.
My estimation gave me 367 datanodes in a year. I’m a bit afraid by that amount of datanodes.
The assumptions, I used are the followings :


-          Daily supply (feed) : 720GB

-          HDFS replication factor: 3

-          Booked space for each disk outside HDFS: 30%

-          Size of a disk: 3TB.

I have two questions.

First, I would like to know if my assumptions are well taken?
Secondly, could someone help me to evaluate that cluster, to let me be sure that my results are not to excessive, please ?

Standing by for your feedback

Warm regard

*************************************************************************
This message and any attachments (the "message") are confidential, intended solely for the addressee(s), and may contain legally privileged information.
Any unauthorised use or dissemination is prohibited. E-mails are susceptible to alteration.
Neither SOCIETE GENERALE nor any of its subsidiaries or affiliates shall be liable for the message if altered, changed or
falsified.
Please visit http://swapdisclosure.sgcib.com for important information with respect to derivative products.
                              ************
Ce message et toutes les pieces jointes (ci-apres le "message") sont confidentiels et susceptibles de contenir des informations couvertes
par le secret professionnel.
Ce message est etabli a l'intention exclusive de ses destinataires. Toute utilisation ou diffusion non autorisee est interdite.
Tout message electronique est susceptible d'alteration.
La SOCIETE GENERALE et ses filiales declinent toute responsabilite au titre de ce message s'il a ete altere, deforme ou falsifie.
Veuillez consulter le site http://swapdisclosure.sgcib.com afin de recueillir d'importantes informations sur les produits derives.
*************************************************************************




RE: Need to evaluate a cluster

Posted by YIMEN YIMGA Gael <ga...@sgcib.com>.
Thank you Mirko, i saw the chapter title PLANNING A HADOOP CLUSTER.

I’ll take that book.

From: Mirko Kämpf [mailto:mirko.kaempf@gmail.com]
Sent: Thursday 10 July 2014 11:22
To: user@hadoop.apache.org
Subject: Re: Need to evaluate a cluster

Just request a quote from the leading and also local vendors. Tell them about the volume and the access pattern you have in mind and collect the offerings. Than you compare the prices. You should consider space (in the data center) network architecture and energy cosumption as well as heat generation in such a cluster which handles some PB on the long run.
Have a look into the book: http://www.amazon.de/Hadoop-Operations-Eric-Sammer/dp/1449327052
Cheers,
Mirko


2014-07-10 11:10 GMT+02:00 YIMEN YIMGA Gael <ga...@sgcib.com>>:
Thank for your return Mirko,

In my case, I can consider compression factor of *8 according to the service in charge of it.

Data, I’m dealing with are : logs only. But it’s many types of logs (printing logs, USB logs, Remote access logs, Active Directory logs, database servers logs, Web servers logs, Antivirus logs, etc.)
I precise that in my case it’s only logs that are stored. Sometime we could have CSV files. But no videos or images are considered here.

Any advice according to that specific type of data?
What are the reasons to consider servers with 12 HDD (3TB) per server? Knowing that, I prefer the LOW-COST.
What could be the price of a LOW-COST server with 12HDD (3TB) ?

Regards

From: Mirko Kämpf [mailto:mirko.kaempf@gmail.com<ma...@gmail.com>]
Sent: Thursday 10 July 2014 11:01

To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Re: Need to evaluate a cluster

I multiply by 1.3 which means I add 30% of the estimated amount to have reserved capacity for intermediate data.
In your case with approx. 2TB per day I think, data nodes with 1 to 3 discs are not a good idea. You should consider servers with more discs and than add one per week. Start with 10 servers and 12 HDD (3TB) per server. This allows you to handle approx. 35 TB raw uncompressed data. You have to evaluated compression in your special case. It can be high, but also not very high, if raw data is already compressed somehow. What data are you dealing with?
Text, messages, logs or more binary data like images, mp3 oder video formats?
Cheers,
Mirko

2014-07-10 10:43 GMT+02:00 YIMEN YIMGA Gael <ga...@sgcib.com>>:
Hi,

What does « 1.3 for overhead » mean in this calculation ?

Regards

From: Mirko Kämpf [mailto:mirko.kaempf@gmail.com<ma...@gmail.com>]
Sent: Wednesday 9 July 2014 18:09

To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Re: Need to evaluate a cluster

Hello,

if I follow your numbers I see one missing fact: What is the number of HDDs per DataNode?
Let's assume you use machines with 6 x 3TB HDDs per box, you would need about 60 DataNodes
per year (0.75 TB per day x 3 for replication x 1.3 for overhead / ( nr of HDDs per node x capacity per HDD )).
With 12 HDD you would only need 30 servers per year.
How did you calculate the number of 367 datanodes?

Cheers,
Mirko

2014-07-09 17:59 GMT+02:00 YIMEN YIMGA Gael <ga...@sgcib.com>>:
Hello Dear,

I made an estimation of a number of nodes of a cluster that can be supplied by 720GB of data/day.
My estimation gave me 367 datanodes in a year. I’m a bit afraid by that amount of datanodes.
The assumptions, I used are the followings :


-          Daily supply (feed) : 720GB

-          HDFS replication factor: 3

-          Booked space for each disk outside HDFS: 30%

-          Size of a disk: 3TB.

I have two questions.

First, I would like to know if my assumptions are well taken?
Secondly, could someone help me to evaluate that cluster, to let me be sure that my results are not to excessive, please ?

Standing by for your feedback

Warm regard

*************************************************************************
This message and any attachments (the "message") are confidential, intended solely for the addressee(s), and may contain legally privileged information.
Any unauthorised use or dissemination is prohibited. E-mails are susceptible to alteration.
Neither SOCIETE GENERALE nor any of its subsidiaries or affiliates shall be liable for the message if altered, changed or
falsified.
Please visit http://swapdisclosure.sgcib.com for important information with respect to derivative products.
                              ************
Ce message et toutes les pieces jointes (ci-apres le "message") sont confidentiels et susceptibles de contenir des informations couvertes
par le secret professionnel.
Ce message est etabli a l'intention exclusive de ses destinataires. Toute utilisation ou diffusion non autorisee est interdite.
Tout message electronique est susceptible d'alteration.
La SOCIETE GENERALE et ses filiales declinent toute responsabilite au titre de ce message s'il a ete altere, deforme ou falsifie.
Veuillez consulter le site http://swapdisclosure.sgcib.com afin de recueillir d'importantes informations sur les produits derives.
*************************************************************************




RE: Need to evaluate a cluster

Posted by YIMEN YIMGA Gael <ga...@sgcib.com>.
Thank you Mirko, i saw the chapter title PLANNING A HADOOP CLUSTER.

I’ll take that book.

From: Mirko Kämpf [mailto:mirko.kaempf@gmail.com]
Sent: Thursday 10 July 2014 11:22
To: user@hadoop.apache.org
Subject: Re: Need to evaluate a cluster

Just request a quote from the leading and also local vendors. Tell them about the volume and the access pattern you have in mind and collect the offerings. Than you compare the prices. You should consider space (in the data center) network architecture and energy cosumption as well as heat generation in such a cluster which handles some PB on the long run.
Have a look into the book: http://www.amazon.de/Hadoop-Operations-Eric-Sammer/dp/1449327052
Cheers,
Mirko


2014-07-10 11:10 GMT+02:00 YIMEN YIMGA Gael <ga...@sgcib.com>>:
Thank for your return Mirko,

In my case, I can consider compression factor of *8 according to the service in charge of it.

Data, I’m dealing with are : logs only. But it’s many types of logs (printing logs, USB logs, Remote access logs, Active Directory logs, database servers logs, Web servers logs, Antivirus logs, etc.)
I precise that in my case it’s only logs that are stored. Sometime we could have CSV files. But no videos or images are considered here.

Any advice according to that specific type of data?
What are the reasons to consider servers with 12 HDD (3TB) per server? Knowing that, I prefer the LOW-COST.
What could be the price of a LOW-COST server with 12HDD (3TB) ?

Regards

From: Mirko Kämpf [mailto:mirko.kaempf@gmail.com<ma...@gmail.com>]
Sent: Thursday 10 July 2014 11:01

To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Re: Need to evaluate a cluster

I multiply by 1.3 which means I add 30% of the estimated amount to have reserved capacity for intermediate data.
In your case with approx. 2TB per day I think, data nodes with 1 to 3 discs are not a good idea. You should consider servers with more discs and than add one per week. Start with 10 servers and 12 HDD (3TB) per server. This allows you to handle approx. 35 TB raw uncompressed data. You have to evaluated compression in your special case. It can be high, but also not very high, if raw data is already compressed somehow. What data are you dealing with?
Text, messages, logs or more binary data like images, mp3 oder video formats?
Cheers,
Mirko

2014-07-10 10:43 GMT+02:00 YIMEN YIMGA Gael <ga...@sgcib.com>>:
Hi,

What does « 1.3 for overhead » mean in this calculation ?

Regards

From: Mirko Kämpf [mailto:mirko.kaempf@gmail.com<ma...@gmail.com>]
Sent: Wednesday 9 July 2014 18:09

To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Re: Need to evaluate a cluster

Hello,

if I follow your numbers I see one missing fact: What is the number of HDDs per DataNode?
Let's assume you use machines with 6 x 3TB HDDs per box, you would need about 60 DataNodes
per year (0.75 TB per day x 3 for replication x 1.3 for overhead / ( nr of HDDs per node x capacity per HDD )).
With 12 HDD you would only need 30 servers per year.
How did you calculate the number of 367 datanodes?

Cheers,
Mirko

2014-07-09 17:59 GMT+02:00 YIMEN YIMGA Gael <ga...@sgcib.com>>:
Hello Dear,

I made an estimation of a number of nodes of a cluster that can be supplied by 720GB of data/day.
My estimation gave me 367 datanodes in a year. I’m a bit afraid by that amount of datanodes.
The assumptions, I used are the followings :


-          Daily supply (feed) : 720GB

-          HDFS replication factor: 3

-          Booked space for each disk outside HDFS: 30%

-          Size of a disk: 3TB.

I have two questions.

First, I would like to know if my assumptions are well taken?
Secondly, could someone help me to evaluate that cluster, to let me be sure that my results are not to excessive, please ?

Standing by for your feedback

Warm regard

*************************************************************************
This message and any attachments (the "message") are confidential, intended solely for the addressee(s), and may contain legally privileged information.
Any unauthorised use or dissemination is prohibited. E-mails are susceptible to alteration.
Neither SOCIETE GENERALE nor any of its subsidiaries or affiliates shall be liable for the message if altered, changed or
falsified.
Please visit http://swapdisclosure.sgcib.com for important information with respect to derivative products.
                              ************
Ce message et toutes les pieces jointes (ci-apres le "message") sont confidentiels et susceptibles de contenir des informations couvertes
par le secret professionnel.
Ce message est etabli a l'intention exclusive de ses destinataires. Toute utilisation ou diffusion non autorisee est interdite.
Tout message electronique est susceptible d'alteration.
La SOCIETE GENERALE et ses filiales declinent toute responsabilite au titre de ce message s'il a ete altere, deforme ou falsifie.
Veuillez consulter le site http://swapdisclosure.sgcib.com afin de recueillir d'importantes informations sur les produits derives.
*************************************************************************




Re: Need to evaluate a cluster

Posted by Mirko Kämpf <mi...@gmail.com>.
Just request a quote from the leading and also local vendors. Tell them
about the volume and the access pattern you have in mind and collect the
offerings. Than you compare the prices. You should consider space (in the
data center) network architecture and energy cosumption as well as heat
generation in such a cluster which handles some PB on the long run.
Have a look into the book:
http://www.amazon.de/Hadoop-Operations-Eric-Sammer/dp/1449327052
Cheers,
Mirko



2014-07-10 11:10 GMT+02:00 YIMEN YIMGA Gael <ga...@sgcib.com>:

> Thank for your return Mirko,
>
>
>
> In my case, I can consider *compression factor of *8* according to the
> service in charge of it.
>
>
>
> Data, I’m dealing with are : logs only. But it’s many types of logs
> (printing logs, USB logs, Remote access logs, Active Directory logs,
> database servers logs, Web servers logs, Antivirus logs, etc.)
>
> I precise that in my case it’s only logs that are stored. Sometime we
> could have CSV files. But no videos or images are considered here.
>
>
>
> Any advice according to that specific type of data?
>
> What are the reasons to consider servers with 12 HDD (3TB) per server?
> Knowing that, I prefer the LOW-COST.
>
> What could be the price of a LOW-COST server with 12HDD (3TB) ?
>
>
>
> Regards
>
>
>
> *From:* Mirko Kämpf [mailto:mirko.kaempf@gmail.com]
> *Sent:* Thursday 10 July 2014 11:01
>
> *To:* user@hadoop.apache.org
> *Subject:* Re: Need to evaluate a cluster
>
>
>
> I multiply by 1.3 which means I add 30% of the estimated amount to have
> reserved capacity for intermediate data.
>
> In your case with approx. 2TB per day I think, data nodes with 1 to 3
> discs are not a good idea. You should consider servers with more discs and
> than add one per week. Start with 10 servers and 12 HDD (3TB) per server.
> This allows you to handle approx. 35 TB raw uncompressed data. You have to
> evaluated compression in your special case. It can be high, but also not
> very high, if raw data is already compressed somehow. What data are you
> dealing with?
>
> Text, messages, logs or more binary data like images, mp3 oder video
> formats?
>
> Cheers,
> Mirko
>
>
>
> 2014-07-10 10:43 GMT+02:00 YIMEN YIMGA Gael <ga...@sgcib.com>:
>
> Hi,
>
>
>
> What does « 1.3 for overhead » mean in this calculation ?
>
>
>
> Regards
>
>
>
> *From:* Mirko Kämpf [mailto:mirko.kaempf@gmail.com]
> *Sent:* Wednesday 9 July 2014 18:09
>
>
> *To:* user@hadoop.apache.org
> *Subject:* Re: Need to evaluate a cluster
>
>
>
> Hello,
>
>
>
> if I follow your numbers I see one missing fact: *What is the number of
> HDDs per DataNode*?
>
> Let's assume you use machines with 6 x 3TB HDDs per box, you would need
> about 60 DataNodes
>
> per year (0.75 TB per day x 3 for replication x 1.3 for overhead / ( nr of
> HDDs per node x capacity per HDD )).
> With 12 HDD you would only need 30 servers per year.
> How did you calculate the number of 367 datanodes?
>
>
>
> Cheers,
>
> Mirko
>
>
>
> 2014-07-09 17:59 GMT+02:00 YIMEN YIMGA Gael <ga...@sgcib.com>:
>
> Hello Dear,
>
>
>
> I made an estimation of a number of nodes of a cluster that can be
> supplied by 720GB of data/day.
>
> My estimation gave me *367 datanodes* in a year. I’m a bit afraid by that
> amount of datanodes.
>
> The assumptions, I used are the followings :
>
>
>
> -          Daily supply (feed) : 720GB
>
> -          HDFS replication factor: 3
>
> -          Booked space for each disk outside HDFS: 30%
>
> -          Size of a disk: 3TB.
>
>
>
> I have two questions.
>
>
>
> First, I would like to know if my assumptions are well taken?
>
> Secondly, could someone help me to evaluate that cluster, to let me be
> sure that my results are not to excessive, please ?
>
>
>
> Standing by for your feedback
>
>
>
> Warm regard
>
> *************************************************************************
> This message and any attachments (the "message") are confidential,
> intended solely for the addressee(s), and may contain legally privileged
> information.
> Any unauthorised use or dissemination is prohibited. E-mails are
> susceptible to alteration.
> Neither SOCIETE GENERALE nor any of its subsidiaries or affiliates shall
> be liable for the message if altered, changed or
> falsified.
> Please visit http://swapdisclosure.sgcib.com for important information
> with respect to derivative products.
>                               ************
> Ce message et toutes les pieces jointes (ci-apres le "message") sont
> confidentiels et susceptibles de contenir des informations couvertes
> par le secret professionnel.
> Ce message est etabli a l'intention exclusive de ses destinataires. Toute
> utilisation ou diffusion non autorisee est interdite.
> Tout message electronique est susceptible d'alteration.
> La SOCIETE GENERALE et ses filiales declinent toute responsabilite au
> titre de ce message s'il a ete altere, deforme ou falsifie.
> Veuillez consulter le site http://swapdisclosure.sgcib.com afin de
> recueillir d'importantes informations sur les produits derives.
> *************************************************************************
>
>
>
>
>

Re: Need to evaluate a cluster

Posted by Mirko Kämpf <mi...@gmail.com>.
Just request a quote from the leading and also local vendors. Tell them
about the volume and the access pattern you have in mind and collect the
offerings. Than you compare the prices. You should consider space (in the
data center) network architecture and energy cosumption as well as heat
generation in such a cluster which handles some PB on the long run.
Have a look into the book:
http://www.amazon.de/Hadoop-Operations-Eric-Sammer/dp/1449327052
Cheers,
Mirko



2014-07-10 11:10 GMT+02:00 YIMEN YIMGA Gael <ga...@sgcib.com>:

> Thank for your return Mirko,
>
>
>
> In my case, I can consider *compression factor of *8* according to the
> service in charge of it.
>
>
>
> Data, I’m dealing with are : logs only. But it’s many types of logs
> (printing logs, USB logs, Remote access logs, Active Directory logs,
> database servers logs, Web servers logs, Antivirus logs, etc.)
>
> I precise that in my case it’s only logs that are stored. Sometime we
> could have CSV files. But no videos or images are considered here.
>
>
>
> Any advice according to that specific type of data?
>
> What are the reasons to consider servers with 12 HDD (3TB) per server?
> Knowing that, I prefer the LOW-COST.
>
> What could be the price of a LOW-COST server with 12HDD (3TB) ?
>
>
>
> Regards
>
>
>
> *From:* Mirko Kämpf [mailto:mirko.kaempf@gmail.com]
> *Sent:* Thursday 10 July 2014 11:01
>
> *To:* user@hadoop.apache.org
> *Subject:* Re: Need to evaluate a cluster
>
>
>
> I multiply by 1.3 which means I add 30% of the estimated amount to have
> reserved capacity for intermediate data.
>
> In your case with approx. 2TB per day I think, data nodes with 1 to 3
> discs are not a good idea. You should consider servers with more discs and
> than add one per week. Start with 10 servers and 12 HDD (3TB) per server.
> This allows you to handle approx. 35 TB raw uncompressed data. You have to
> evaluated compression in your special case. It can be high, but also not
> very high, if raw data is already compressed somehow. What data are you
> dealing with?
>
> Text, messages, logs or more binary data like images, mp3 oder video
> formats?
>
> Cheers,
> Mirko
>
>
>
> 2014-07-10 10:43 GMT+02:00 YIMEN YIMGA Gael <ga...@sgcib.com>:
>
> Hi,
>
>
>
> What does « 1.3 for overhead » mean in this calculation ?
>
>
>
> Regards
>
>
>
> *From:* Mirko Kämpf [mailto:mirko.kaempf@gmail.com]
> *Sent:* Wednesday 9 July 2014 18:09
>
>
> *To:* user@hadoop.apache.org
> *Subject:* Re: Need to evaluate a cluster
>
>
>
> Hello,
>
>
>
> if I follow your numbers I see one missing fact: *What is the number of
> HDDs per DataNode*?
>
> Let's assume you use machines with 6 x 3TB HDDs per box, you would need
> about 60 DataNodes
>
> per year (0.75 TB per day x 3 for replication x 1.3 for overhead / ( nr of
> HDDs per node x capacity per HDD )).
> With 12 HDD you would only need 30 servers per year.
> How did you calculate the number of 367 datanodes?
>
>
>
> Cheers,
>
> Mirko
>
>
>
> 2014-07-09 17:59 GMT+02:00 YIMEN YIMGA Gael <ga...@sgcib.com>:
>
> Hello Dear,
>
>
>
> I made an estimation of a number of nodes of a cluster that can be
> supplied by 720GB of data/day.
>
> My estimation gave me *367 datanodes* in a year. I’m a bit afraid by that
> amount of datanodes.
>
> The assumptions, I used are the followings :
>
>
>
> -          Daily supply (feed) : 720GB
>
> -          HDFS replication factor: 3
>
> -          Booked space for each disk outside HDFS: 30%
>
> -          Size of a disk: 3TB.
>
>
>
> I have two questions.
>
>
>
> First, I would like to know if my assumptions are well taken?
>
> Secondly, could someone help me to evaluate that cluster, to let me be
> sure that my results are not to excessive, please ?
>
>
>
> Standing by for your feedback
>
>
>
> Warm regard
>
> *************************************************************************
> This message and any attachments (the "message") are confidential,
> intended solely for the addressee(s), and may contain legally privileged
> information.
> Any unauthorised use or dissemination is prohibited. E-mails are
> susceptible to alteration.
> Neither SOCIETE GENERALE nor any of its subsidiaries or affiliates shall
> be liable for the message if altered, changed or
> falsified.
> Please visit http://swapdisclosure.sgcib.com for important information
> with respect to derivative products.
>                               ************
> Ce message et toutes les pieces jointes (ci-apres le "message") sont
> confidentiels et susceptibles de contenir des informations couvertes
> par le secret professionnel.
> Ce message est etabli a l'intention exclusive de ses destinataires. Toute
> utilisation ou diffusion non autorisee est interdite.
> Tout message electronique est susceptible d'alteration.
> La SOCIETE GENERALE et ses filiales declinent toute responsabilite au
> titre de ce message s'il a ete altere, deforme ou falsifie.
> Veuillez consulter le site http://swapdisclosure.sgcib.com afin de
> recueillir d'importantes informations sur les produits derives.
> *************************************************************************
>
>
>
>
>

RE: Need to evaluate a cluster

Posted by YIMEN YIMGA Gael <ga...@sgcib.com>.
Hi Olivier,

When I say LOW-COST, I mean COMMODITY HARDWARE.

Could you advice please ?

From: Olivier Renault [mailto:orenault@hortonworks.com]
Sent: Thursday 10 July 2014 11:18
To: user@hadoop.apache.org
Subject: Re: Need to evaluate a cluster

Either you spend your money on servers with more disks or your spend your money on cooling / power consumption and potentially building a new DC ;).

A typical server from a tier 1 vendor ( HP, Dell, IBM, Cisco ) should be around 5k euros ( fully loaded with HDD ).

Kind regards,
Olivier

On 10 July 2014 11:10, YIMEN YIMGA Gael <ga...@sgcib.com>> wrote:
Thank for your return Mirko,

In my case, I can consider compression factor of *8 according to the service in charge of it.

Data, I’m dealing with are : logs only. But it’s many types of logs (printing logs, USB logs, Remote access logs, Active Directory logs, database servers logs, Web servers logs, Antivirus logs, etc.)
I precise that in my case it’s only logs that are stored. Sometime we could have CSV files. But no videos or images are considered here.

Any advice according to that specific type of data?
What are the reasons to consider servers with 12 HDD (3TB) per server? Knowing that, I prefer the LOW-COST.
What could be the price of a LOW-COST server with 12HDD (3TB) ?

Regards

From: Mirko Kämpf [mailto:mirko.kaempf@gmail.com<ma...@gmail.com>]
Sent: Thursday 10 July 2014 11:01

To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Re: Need to evaluate a cluster

I multiply by 1.3 which means I add 30% of the estimated amount to have reserved capacity for intermediate data.
In your case with approx. 2TB per day I think, data nodes with 1 to 3 discs are not a good idea. You should consider servers with more discs and than add one per week. Start with 10 servers and 12 HDD (3TB) per server. This allows you to handle approx. 35 TB raw uncompressed data. You have to evaluated compression in your special case. It can be high, but also not very high, if raw data is already compressed somehow. What data are you dealing with?
Text, messages, logs or more binary data like images, mp3 oder video formats?
Cheers,
Mirko

2014-07-10 10:43 GMT+02:00 YIMEN YIMGA Gael <ga...@sgcib.com>>:
Hi,

What does « 1.3 for overhead » mean in this calculation ?

Regards

From: Mirko Kämpf [mailto:mirko.kaempf@gmail.com<ma...@gmail.com>]
Sent: Wednesday 9 July 2014 18:09

To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Re: Need to evaluate a cluster

Hello,

if I follow your numbers I see one missing fact: What is the number of HDDs per DataNode?
Let's assume you use machines with 6 x 3TB HDDs per box, you would need about 60 DataNodes
per year (0.75 TB per day x 3 for replication x 1.3 for overhead / ( nr of HDDs per node x capacity per HDD )).
With 12 HDD you would only need 30 servers per year.
How did you calculate the number of 367 datanodes?

Cheers,
Mirko

2014-07-09 17:59 GMT+02:00 YIMEN YIMGA Gael <ga...@sgcib.com>>:
Hello Dear,

I made an estimation of a number of nodes of a cluster that can be supplied by 720GB of data/day.
My estimation gave me 367 datanodes in a year. I’m a bit afraid by that amount of datanodes.
The assumptions, I used are the followings :


-          Daily supply (feed) : 720GB

-          HDFS replication factor: 3

-          Booked space for each disk outside HDFS: 30%

-          Size of a disk: 3TB.

I have two questions.

First, I would like to know if my assumptions are well taken?
Secondly, could someone help me to evaluate that cluster, to let me be sure that my results are not to excessive, please ?

Standing by for your feedback

Warm regard

*************************************************************************
This message and any attachments (the "message") are confidential, intended solely for the addressee(s), and may contain legally privileged information.
Any unauthorised use or dissemination is prohibited. E-mails are susceptible to alteration.
Neither SOCIETE GENERALE nor any of its subsidiaries or affiliates shall be liable for the message if altered, changed or
falsified.
Please visit http://swapdisclosure.sgcib.com for important information with respect to derivative products.
                              ************
Ce message et toutes les pieces jointes (ci-apres le "message") sont confidentiels et susceptibles de contenir des informations couvertes
par le secret professionnel.
Ce message est etabli a l'intention exclusive de ses destinataires. Toute utilisation ou diffusion non autorisee est interdite.
Tout message electronique est susceptible d'alteration.
La SOCIETE GENERALE et ses filiales declinent toute responsabilite au titre de ce message s'il a ete altere, deforme ou falsifie.
Veuillez consulter le site http://swapdisclosure.sgcib.com afin de recueillir d'importantes informations sur les produits derives.
*************************************************************************





--
Olivier Renault
Solution Engineer , Hortonworks
Mobile: +44 7500 933 036
Email: orenault@hortonworks.com<ma...@hortonworks.com>
Website: http://www.hortonworks.com/
[cid:image001.jpg@01CF9C31.916E9DF0]<http://www.twitter.com/hortonworks?utm_source=WiseStamp&utm_medium=email&utm_term=&utm_content=&utm_campaign=signature>[cid:image001.jpg@01CF9C31.916E9DF0]<http://www.linkedin.com/company/hortonworks?utm_source=WiseStamp&utm_medium=email&utm_term=&utm_content=&utm_campaign=signature>[cid:image001.jpg@01CF9C31.916E9DF0]<http://www.facebook.com/hortonworks?utm_source=WiseStamp&utm_medium=email&utm_term=&utm_content=&utm_campaign=signature>

[cid:~WRD000.jpg]

Latest From Our Blog: Accenture and Hortonworks Announce Alliance <http://hortonworks.com/blog/accenture-hortonworks-alliance/?utm_source=WiseStamp&utm_medium=email&utm_term=&utm_content=&utm_campaign=signature>

CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.

RE: Need to evaluate a cluster

Posted by YIMEN YIMGA Gael <ga...@sgcib.com>.
Hi Olivier,

When I say LOW-COST, I mean COMMODITY HARDWARE.

Could you advice please ?

From: Olivier Renault [mailto:orenault@hortonworks.com]
Sent: Thursday 10 July 2014 11:18
To: user@hadoop.apache.org
Subject: Re: Need to evaluate a cluster

Either you spend your money on servers with more disks or your spend your money on cooling / power consumption and potentially building a new DC ;).

A typical server from a tier 1 vendor ( HP, Dell, IBM, Cisco ) should be around 5k euros ( fully loaded with HDD ).

Kind regards,
Olivier

On 10 July 2014 11:10, YIMEN YIMGA Gael <ga...@sgcib.com>> wrote:
Thank for your return Mirko,

In my case, I can consider compression factor of *8 according to the service in charge of it.

Data, I’m dealing with are : logs only. But it’s many types of logs (printing logs, USB logs, Remote access logs, Active Directory logs, database servers logs, Web servers logs, Antivirus logs, etc.)
I precise that in my case it’s only logs that are stored. Sometime we could have CSV files. But no videos or images are considered here.

Any advice according to that specific type of data?
What are the reasons to consider servers with 12 HDD (3TB) per server? Knowing that, I prefer the LOW-COST.
What could be the price of a LOW-COST server with 12HDD (3TB) ?

Regards

From: Mirko Kämpf [mailto:mirko.kaempf@gmail.com<ma...@gmail.com>]
Sent: Thursday 10 July 2014 11:01

To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Re: Need to evaluate a cluster

I multiply by 1.3 which means I add 30% of the estimated amount to have reserved capacity for intermediate data.
In your case with approx. 2TB per day I think, data nodes with 1 to 3 discs are not a good idea. You should consider servers with more discs and than add one per week. Start with 10 servers and 12 HDD (3TB) per server. This allows you to handle approx. 35 TB raw uncompressed data. You have to evaluated compression in your special case. It can be high, but also not very high, if raw data is already compressed somehow. What data are you dealing with?
Text, messages, logs or more binary data like images, mp3 oder video formats?
Cheers,
Mirko

2014-07-10 10:43 GMT+02:00 YIMEN YIMGA Gael <ga...@sgcib.com>>:
Hi,

What does « 1.3 for overhead » mean in this calculation ?

Regards

From: Mirko Kämpf [mailto:mirko.kaempf@gmail.com<ma...@gmail.com>]
Sent: Wednesday 9 July 2014 18:09

To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Re: Need to evaluate a cluster

Hello,

if I follow your numbers I see one missing fact: What is the number of HDDs per DataNode?
Let's assume you use machines with 6 x 3TB HDDs per box, you would need about 60 DataNodes
per year (0.75 TB per day x 3 for replication x 1.3 for overhead / ( nr of HDDs per node x capacity per HDD )).
With 12 HDD you would only need 30 servers per year.
How did you calculate the number of 367 datanodes?

Cheers,
Mirko

2014-07-09 17:59 GMT+02:00 YIMEN YIMGA Gael <ga...@sgcib.com>>:
Hello Dear,

I made an estimation of a number of nodes of a cluster that can be supplied by 720GB of data/day.
My estimation gave me 367 datanodes in a year. I’m a bit afraid by that amount of datanodes.
The assumptions, I used are the followings :


-          Daily supply (feed) : 720GB

-          HDFS replication factor: 3

-          Booked space for each disk outside HDFS: 30%

-          Size of a disk: 3TB.

I have two questions.

First, I would like to know if my assumptions are well taken?
Secondly, could someone help me to evaluate that cluster, to let me be sure that my results are not to excessive, please ?

Standing by for your feedback

Warm regard

*************************************************************************
This message and any attachments (the "message") are confidential, intended solely for the addressee(s), and may contain legally privileged information.
Any unauthorised use or dissemination is prohibited. E-mails are susceptible to alteration.
Neither SOCIETE GENERALE nor any of its subsidiaries or affiliates shall be liable for the message if altered, changed or
falsified.
Please visit http://swapdisclosure.sgcib.com for important information with respect to derivative products.
                              ************
Ce message et toutes les pieces jointes (ci-apres le "message") sont confidentiels et susceptibles de contenir des informations couvertes
par le secret professionnel.
Ce message est etabli a l'intention exclusive de ses destinataires. Toute utilisation ou diffusion non autorisee est interdite.
Tout message electronique est susceptible d'alteration.
La SOCIETE GENERALE et ses filiales declinent toute responsabilite au titre de ce message s'il a ete altere, deforme ou falsifie.
Veuillez consulter le site http://swapdisclosure.sgcib.com afin de recueillir d'importantes informations sur les produits derives.
*************************************************************************





--
Olivier Renault
Solution Engineer , Hortonworks
Mobile: +44 7500 933 036
Email: orenault@hortonworks.com<ma...@hortonworks.com>
Website: http://www.hortonworks.com/
[cid:image001.jpg@01CF9C31.916E9DF0]<http://www.twitter.com/hortonworks?utm_source=WiseStamp&utm_medium=email&utm_term=&utm_content=&utm_campaign=signature>[cid:image001.jpg@01CF9C31.916E9DF0]<http://www.linkedin.com/company/hortonworks?utm_source=WiseStamp&utm_medium=email&utm_term=&utm_content=&utm_campaign=signature>[cid:image001.jpg@01CF9C31.916E9DF0]<http://www.facebook.com/hortonworks?utm_source=WiseStamp&utm_medium=email&utm_term=&utm_content=&utm_campaign=signature>

[cid:~WRD000.jpg]

Latest From Our Blog: Accenture and Hortonworks Announce Alliance <http://hortonworks.com/blog/accenture-hortonworks-alliance/?utm_source=WiseStamp&utm_medium=email&utm_term=&utm_content=&utm_campaign=signature>

CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.

RE: Need to evaluate a cluster

Posted by YIMEN YIMGA Gael <ga...@sgcib.com>.
Hi Olivier,

When I say LOW-COST, I mean COMMODITY HARDWARE.

Could you advice please ?

From: Olivier Renault [mailto:orenault@hortonworks.com]
Sent: Thursday 10 July 2014 11:18
To: user@hadoop.apache.org
Subject: Re: Need to evaluate a cluster

Either you spend your money on servers with more disks or your spend your money on cooling / power consumption and potentially building a new DC ;).

A typical server from a tier 1 vendor ( HP, Dell, IBM, Cisco ) should be around 5k euros ( fully loaded with HDD ).

Kind regards,
Olivier

On 10 July 2014 11:10, YIMEN YIMGA Gael <ga...@sgcib.com>> wrote:
Thank for your return Mirko,

In my case, I can consider compression factor of *8 according to the service in charge of it.

Data, I’m dealing with are : logs only. But it’s many types of logs (printing logs, USB logs, Remote access logs, Active Directory logs, database servers logs, Web servers logs, Antivirus logs, etc.)
I precise that in my case it’s only logs that are stored. Sometime we could have CSV files. But no videos or images are considered here.

Any advice according to that specific type of data?
What are the reasons to consider servers with 12 HDD (3TB) per server? Knowing that, I prefer the LOW-COST.
What could be the price of a LOW-COST server with 12HDD (3TB) ?

Regards

From: Mirko Kämpf [mailto:mirko.kaempf@gmail.com<ma...@gmail.com>]
Sent: Thursday 10 July 2014 11:01

To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Re: Need to evaluate a cluster

I multiply by 1.3 which means I add 30% of the estimated amount to have reserved capacity for intermediate data.
In your case with approx. 2TB per day I think, data nodes with 1 to 3 discs are not a good idea. You should consider servers with more discs and than add one per week. Start with 10 servers and 12 HDD (3TB) per server. This allows you to handle approx. 35 TB raw uncompressed data. You have to evaluated compression in your special case. It can be high, but also not very high, if raw data is already compressed somehow. What data are you dealing with?
Text, messages, logs or more binary data like images, mp3 oder video formats?
Cheers,
Mirko

2014-07-10 10:43 GMT+02:00 YIMEN YIMGA Gael <ga...@sgcib.com>>:
Hi,

What does « 1.3 for overhead » mean in this calculation ?

Regards

From: Mirko Kämpf [mailto:mirko.kaempf@gmail.com<ma...@gmail.com>]
Sent: Wednesday 9 July 2014 18:09

To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Re: Need to evaluate a cluster

Hello,

if I follow your numbers I see one missing fact: What is the number of HDDs per DataNode?
Let's assume you use machines with 6 x 3TB HDDs per box, you would need about 60 DataNodes
per year (0.75 TB per day x 3 for replication x 1.3 for overhead / ( nr of HDDs per node x capacity per HDD )).
With 12 HDD you would only need 30 servers per year.
How did you calculate the number of 367 datanodes?

Cheers,
Mirko

2014-07-09 17:59 GMT+02:00 YIMEN YIMGA Gael <ga...@sgcib.com>>:
Hello Dear,

I made an estimation of a number of nodes of a cluster that can be supplied by 720GB of data/day.
My estimation gave me 367 datanodes in a year. I’m a bit afraid by that amount of datanodes.
The assumptions, I used are the followings :


-          Daily supply (feed) : 720GB

-          HDFS replication factor: 3

-          Booked space for each disk outside HDFS: 30%

-          Size of a disk: 3TB.

I have two questions.

First, I would like to know if my assumptions are well taken?
Secondly, could someone help me to evaluate that cluster, to let me be sure that my results are not to excessive, please ?

Standing by for your feedback

Warm regard

*************************************************************************
This message and any attachments (the "message") are confidential, intended solely for the addressee(s), and may contain legally privileged information.
Any unauthorised use or dissemination is prohibited. E-mails are susceptible to alteration.
Neither SOCIETE GENERALE nor any of its subsidiaries or affiliates shall be liable for the message if altered, changed or
falsified.
Please visit http://swapdisclosure.sgcib.com for important information with respect to derivative products.
                              ************
Ce message et toutes les pieces jointes (ci-apres le "message") sont confidentiels et susceptibles de contenir des informations couvertes
par le secret professionnel.
Ce message est etabli a l'intention exclusive de ses destinataires. Toute utilisation ou diffusion non autorisee est interdite.
Tout message electronique est susceptible d'alteration.
La SOCIETE GENERALE et ses filiales declinent toute responsabilite au titre de ce message s'il a ete altere, deforme ou falsifie.
Veuillez consulter le site http://swapdisclosure.sgcib.com afin de recueillir d'importantes informations sur les produits derives.
*************************************************************************





--
Olivier Renault
Solution Engineer , Hortonworks
Mobile: +44 7500 933 036
Email: orenault@hortonworks.com<ma...@hortonworks.com>
Website: http://www.hortonworks.com/
[cid:image001.jpg@01CF9C31.916E9DF0]<http://www.twitter.com/hortonworks?utm_source=WiseStamp&utm_medium=email&utm_term=&utm_content=&utm_campaign=signature>[cid:image001.jpg@01CF9C31.916E9DF0]<http://www.linkedin.com/company/hortonworks?utm_source=WiseStamp&utm_medium=email&utm_term=&utm_content=&utm_campaign=signature>[cid:image001.jpg@01CF9C31.916E9DF0]<http://www.facebook.com/hortonworks?utm_source=WiseStamp&utm_medium=email&utm_term=&utm_content=&utm_campaign=signature>

[cid:~WRD000.jpg]

Latest From Our Blog: Accenture and Hortonworks Announce Alliance <http://hortonworks.com/blog/accenture-hortonworks-alliance/?utm_source=WiseStamp&utm_medium=email&utm_term=&utm_content=&utm_campaign=signature>

CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.

RE: Need to evaluate a cluster

Posted by YIMEN YIMGA Gael <ga...@sgcib.com>.
Hi Olivier,

When I say LOW-COST, I mean COMMODITY HARDWARE.

Could you advice please ?

From: Olivier Renault [mailto:orenault@hortonworks.com]
Sent: Thursday 10 July 2014 11:18
To: user@hadoop.apache.org
Subject: Re: Need to evaluate a cluster

Either you spend your money on servers with more disks or your spend your money on cooling / power consumption and potentially building a new DC ;).

A typical server from a tier 1 vendor ( HP, Dell, IBM, Cisco ) should be around 5k euros ( fully loaded with HDD ).

Kind regards,
Olivier

On 10 July 2014 11:10, YIMEN YIMGA Gael <ga...@sgcib.com>> wrote:
Thank for your return Mirko,

In my case, I can consider compression factor of *8 according to the service in charge of it.

Data, I’m dealing with are : logs only. But it’s many types of logs (printing logs, USB logs, Remote access logs, Active Directory logs, database servers logs, Web servers logs, Antivirus logs, etc.)
I precise that in my case it’s only logs that are stored. Sometime we could have CSV files. But no videos or images are considered here.

Any advice according to that specific type of data?
What are the reasons to consider servers with 12 HDD (3TB) per server? Knowing that, I prefer the LOW-COST.
What could be the price of a LOW-COST server with 12HDD (3TB) ?

Regards

From: Mirko Kämpf [mailto:mirko.kaempf@gmail.com<ma...@gmail.com>]
Sent: Thursday 10 July 2014 11:01

To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Re: Need to evaluate a cluster

I multiply by 1.3 which means I add 30% of the estimated amount to have reserved capacity for intermediate data.
In your case with approx. 2TB per day I think, data nodes with 1 to 3 discs are not a good idea. You should consider servers with more discs and than add one per week. Start with 10 servers and 12 HDD (3TB) per server. This allows you to handle approx. 35 TB raw uncompressed data. You have to evaluated compression in your special case. It can be high, but also not very high, if raw data is already compressed somehow. What data are you dealing with?
Text, messages, logs or more binary data like images, mp3 oder video formats?
Cheers,
Mirko

2014-07-10 10:43 GMT+02:00 YIMEN YIMGA Gael <ga...@sgcib.com>>:
Hi,

What does « 1.3 for overhead » mean in this calculation ?

Regards

From: Mirko Kämpf [mailto:mirko.kaempf@gmail.com<ma...@gmail.com>]
Sent: Wednesday 9 July 2014 18:09

To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Re: Need to evaluate a cluster

Hello,

if I follow your numbers I see one missing fact: What is the number of HDDs per DataNode?
Let's assume you use machines with 6 x 3TB HDDs per box, you would need about 60 DataNodes
per year (0.75 TB per day x 3 for replication x 1.3 for overhead / ( nr of HDDs per node x capacity per HDD )).
With 12 HDD you would only need 30 servers per year.
How did you calculate the number of 367 datanodes?

Cheers,
Mirko

2014-07-09 17:59 GMT+02:00 YIMEN YIMGA Gael <ga...@sgcib.com>>:
Hello Dear,

I made an estimation of a number of nodes of a cluster that can be supplied by 720GB of data/day.
My estimation gave me 367 datanodes in a year. I’m a bit afraid by that amount of datanodes.
The assumptions, I used are the followings :


-          Daily supply (feed) : 720GB

-          HDFS replication factor: 3

-          Booked space for each disk outside HDFS: 30%

-          Size of a disk: 3TB.

I have two questions.

First, I would like to know if my assumptions are well taken?
Secondly, could someone help me to evaluate that cluster, to let me be sure that my results are not to excessive, please ?

Standing by for your feedback

Warm regard

*************************************************************************
This message and any attachments (the "message") are confidential, intended solely for the addressee(s), and may contain legally privileged information.
Any unauthorised use or dissemination is prohibited. E-mails are susceptible to alteration.
Neither SOCIETE GENERALE nor any of its subsidiaries or affiliates shall be liable for the message if altered, changed or
falsified.
Please visit http://swapdisclosure.sgcib.com for important information with respect to derivative products.
                              ************
Ce message et toutes les pieces jointes (ci-apres le "message") sont confidentiels et susceptibles de contenir des informations couvertes
par le secret professionnel.
Ce message est etabli a l'intention exclusive de ses destinataires. Toute utilisation ou diffusion non autorisee est interdite.
Tout message electronique est susceptible d'alteration.
La SOCIETE GENERALE et ses filiales declinent toute responsabilite au titre de ce message s'il a ete altere, deforme ou falsifie.
Veuillez consulter le site http://swapdisclosure.sgcib.com afin de recueillir d'importantes informations sur les produits derives.
*************************************************************************





--
Olivier Renault
Solution Engineer , Hortonworks
Mobile: +44 7500 933 036
Email: orenault@hortonworks.com<ma...@hortonworks.com>
Website: http://www.hortonworks.com/
[cid:image001.jpg@01CF9C31.916E9DF0]<http://www.twitter.com/hortonworks?utm_source=WiseStamp&utm_medium=email&utm_term=&utm_content=&utm_campaign=signature>[cid:image001.jpg@01CF9C31.916E9DF0]<http://www.linkedin.com/company/hortonworks?utm_source=WiseStamp&utm_medium=email&utm_term=&utm_content=&utm_campaign=signature>[cid:image001.jpg@01CF9C31.916E9DF0]<http://www.facebook.com/hortonworks?utm_source=WiseStamp&utm_medium=email&utm_term=&utm_content=&utm_campaign=signature>

[cid:~WRD000.jpg]

Latest From Our Blog: Accenture and Hortonworks Announce Alliance <http://hortonworks.com/blog/accenture-hortonworks-alliance/?utm_source=WiseStamp&utm_medium=email&utm_term=&utm_content=&utm_campaign=signature>

CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.

Re: Need to evaluate a cluster

Posted by Olivier Renault <or...@hortonworks.com>.
Either you spend your money on servers with more disks or your spend your
money on cooling / power consumption and potentially building a new DC ;).

A typical server from a tier 1 vendor ( HP, Dell, IBM, Cisco ) should be
around 5k euros ( fully loaded with HDD ).

Kind regards,
Olivier


On 10 July 2014 11:10, YIMEN YIMGA Gael <ga...@sgcib.com> wrote:

> Thank for your return Mirko,
>
>
>
> In my case, I can consider *compression factor of *8* according to the
> service in charge of it.
>
>
>
> Data, I’m dealing with are : logs only. But it’s many types of logs
> (printing logs, USB logs, Remote access logs, Active Directory logs,
> database servers logs, Web servers logs, Antivirus logs, etc.)
>
> I precise that in my case it’s only logs that are stored. Sometime we
> could have CSV files. But no videos or images are considered here.
>
>
>
> Any advice according to that specific type of data?
>
> What are the reasons to consider servers with 12 HDD (3TB) per server?
> Knowing that, I prefer the LOW-COST.
>
> What could be the price of a LOW-COST server with 12HDD (3TB) ?
>
>
>
> Regards
>
>
>
> *From:* Mirko Kämpf [mailto:mirko.kaempf@gmail.com]
> *Sent:* Thursday 10 July 2014 11:01
>
> *To:* user@hadoop.apache.org
> *Subject:* Re: Need to evaluate a cluster
>
>
>
> I multiply by 1.3 which means I add 30% of the estimated amount to have
> reserved capacity for intermediate data.
>
> In your case with approx. 2TB per day I think, data nodes with 1 to 3
> discs are not a good idea. You should consider servers with more discs and
> than add one per week. Start with 10 servers and 12 HDD (3TB) per server.
> This allows you to handle approx. 35 TB raw uncompressed data. You have to
> evaluated compression in your special case. It can be high, but also not
> very high, if raw data is already compressed somehow. What data are you
> dealing with?
>
> Text, messages, logs or more binary data like images, mp3 oder video
> formats?
>
> Cheers,
> Mirko
>
>
>
> 2014-07-10 10:43 GMT+02:00 YIMEN YIMGA Gael <ga...@sgcib.com>:
>
> Hi,
>
>
>
> What does « 1.3 for overhead » mean in this calculation ?
>
>
>
> Regards
>
>
>
> *From:* Mirko Kämpf [mailto:mirko.kaempf@gmail.com]
> *Sent:* Wednesday 9 July 2014 18:09
>
>
> *To:* user@hadoop.apache.org
> *Subject:* Re: Need to evaluate a cluster
>
>
>
> Hello,
>
>
>
> if I follow your numbers I see one missing fact: *What is the number of
> HDDs per DataNode*?
>
> Let's assume you use machines with 6 x 3TB HDDs per box, you would need
> about 60 DataNodes
>
> per year (0.75 TB per day x 3 for replication x 1.3 for overhead / ( nr of
> HDDs per node x capacity per HDD )).
> With 12 HDD you would only need 30 servers per year.
> How did you calculate the number of 367 datanodes?
>
>
>
> Cheers,
>
> Mirko
>
>
>
> 2014-07-09 17:59 GMT+02:00 YIMEN YIMGA Gael <ga...@sgcib.com>:
>
> Hello Dear,
>
>
>
> I made an estimation of a number of nodes of a cluster that can be
> supplied by 720GB of data/day.
>
> My estimation gave me *367 datanodes* in a year. I’m a bit afraid by that
> amount of datanodes.
>
> The assumptions, I used are the followings :
>
>
>
> -          Daily supply (feed) : 720GB
>
> -          HDFS replication factor: 3
>
> -          Booked space for each disk outside HDFS: 30%
>
> -          Size of a disk: 3TB.
>
>
>
> I have two questions.
>
>
>
> First, I would like to know if my assumptions are well taken?
>
> Secondly, could someone help me to evaluate that cluster, to let me be
> sure that my results are not to excessive, please ?
>
>
>
> Standing by for your feedback
>
>
>
> Warm regard
>
> *************************************************************************
> This message and any attachments (the "message") are confidential,
> intended solely for the addressee(s), and may contain legally privileged
> information.
> Any unauthorised use or dissemination is prohibited. E-mails are
> susceptible to alteration.
> Neither SOCIETE GENERALE nor any of its subsidiaries or affiliates shall
> be liable for the message if altered, changed or
> falsified.
> Please visit http://swapdisclosure.sgcib.com for important information
> with respect to derivative products.
>                               ************
> Ce message et toutes les pieces jointes (ci-apres le "message") sont
> confidentiels et susceptibles de contenir des informations couvertes
> par le secret professionnel.
> Ce message est etabli a l'intention exclusive de ses destinataires. Toute
> utilisation ou diffusion non autorisee est interdite.
> Tout message electronique est susceptible d'alteration.
> La SOCIETE GENERALE et ses filiales declinent toute responsabilite au
> titre de ce message s'il a ete altere, deforme ou falsifie.
> Veuillez consulter le site http://swapdisclosure.sgcib.com afin de
> recueillir d'importantes informations sur les produits derives.
> *************************************************************************
>
>
>
>
>



-- 
  Olivier Renault
 Solution Engineer , Hortonworks
   Mobile: +44 7500 933 036
 Email: orenault@hortonworks.com
 Website: http://www.hortonworks.com/

<http://www.twitter.com/hortonworks?utm_source=WiseStamp&utm_medium=email&utm_term=&utm_content=&utm_campaign=signature>
<http://www.linkedin.com/company/hortonworks?utm_source=WiseStamp&utm_medium=email&utm_term=&utm_content=&utm_campaign=signature>
<http://www.facebook.com/hortonworks?utm_source=WiseStamp&utm_medium=email&utm_term=&utm_content=&utm_campaign=signature>
 [image: photo]
  Latest From Our Blog:  Accenture and Hortonworks Announce Alliance
<http://hortonworks.com/blog/accenture-hortonworks-alliance/?utm_source=WiseStamp&utm_medium=email&utm_term=&utm_content=&utm_campaign=signature>

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: Need to evaluate a cluster

Posted by Olivier Renault <or...@hortonworks.com>.
Either you spend your money on servers with more disks or your spend your
money on cooling / power consumption and potentially building a new DC ;).

A typical server from a tier 1 vendor ( HP, Dell, IBM, Cisco ) should be
around 5k euros ( fully loaded with HDD ).

Kind regards,
Olivier


On 10 July 2014 11:10, YIMEN YIMGA Gael <ga...@sgcib.com> wrote:

> Thank for your return Mirko,
>
>
>
> In my case, I can consider *compression factor of *8* according to the
> service in charge of it.
>
>
>
> Data, I’m dealing with are : logs only. But it’s many types of logs
> (printing logs, USB logs, Remote access logs, Active Directory logs,
> database servers logs, Web servers logs, Antivirus logs, etc.)
>
> I precise that in my case it’s only logs that are stored. Sometime we
> could have CSV files. But no videos or images are considered here.
>
>
>
> Any advice according to that specific type of data?
>
> What are the reasons to consider servers with 12 HDD (3TB) per server?
> Knowing that, I prefer the LOW-COST.
>
> What could be the price of a LOW-COST server with 12HDD (3TB) ?
>
>
>
> Regards
>
>
>
> *From:* Mirko Kämpf [mailto:mirko.kaempf@gmail.com]
> *Sent:* Thursday 10 July 2014 11:01
>
> *To:* user@hadoop.apache.org
> *Subject:* Re: Need to evaluate a cluster
>
>
>
> I multiply by 1.3 which means I add 30% of the estimated amount to have
> reserved capacity for intermediate data.
>
> In your case with approx. 2TB per day I think, data nodes with 1 to 3
> discs are not a good idea. You should consider servers with more discs and
> than add one per week. Start with 10 servers and 12 HDD (3TB) per server.
> This allows you to handle approx. 35 TB raw uncompressed data. You have to
> evaluated compression in your special case. It can be high, but also not
> very high, if raw data is already compressed somehow. What data are you
> dealing with?
>
> Text, messages, logs or more binary data like images, mp3 oder video
> formats?
>
> Cheers,
> Mirko
>
>
>
> 2014-07-10 10:43 GMT+02:00 YIMEN YIMGA Gael <ga...@sgcib.com>:
>
> Hi,
>
>
>
> What does « 1.3 for overhead » mean in this calculation ?
>
>
>
> Regards
>
>
>
> *From:* Mirko Kämpf [mailto:mirko.kaempf@gmail.com]
> *Sent:* Wednesday 9 July 2014 18:09
>
>
> *To:* user@hadoop.apache.org
> *Subject:* Re: Need to evaluate a cluster
>
>
>
> Hello,
>
>
>
> if I follow your numbers I see one missing fact: *What is the number of
> HDDs per DataNode*?
>
> Let's assume you use machines with 6 x 3TB HDDs per box, you would need
> about 60 DataNodes
>
> per year (0.75 TB per day x 3 for replication x 1.3 for overhead / ( nr of
> HDDs per node x capacity per HDD )).
> With 12 HDD you would only need 30 servers per year.
> How did you calculate the number of 367 datanodes?
>
>
>
> Cheers,
>
> Mirko
>
>
>
> 2014-07-09 17:59 GMT+02:00 YIMEN YIMGA Gael <ga...@sgcib.com>:
>
> Hello Dear,
>
>
>
> I made an estimation of a number of nodes of a cluster that can be
> supplied by 720GB of data/day.
>
> My estimation gave me *367 datanodes* in a year. I’m a bit afraid by that
> amount of datanodes.
>
> The assumptions, I used are the followings :
>
>
>
> -          Daily supply (feed) : 720GB
>
> -          HDFS replication factor: 3
>
> -          Booked space for each disk outside HDFS: 30%
>
> -          Size of a disk: 3TB.
>
>
>
> I have two questions.
>
>
>
> First, I would like to know if my assumptions are well taken?
>
> Secondly, could someone help me to evaluate that cluster, to let me be
> sure that my results are not to excessive, please ?
>
>
>
> Standing by for your feedback
>
>
>
> Warm regard
>
> *************************************************************************
> This message and any attachments (the "message") are confidential,
> intended solely for the addressee(s), and may contain legally privileged
> information.
> Any unauthorised use or dissemination is prohibited. E-mails are
> susceptible to alteration.
> Neither SOCIETE GENERALE nor any of its subsidiaries or affiliates shall
> be liable for the message if altered, changed or
> falsified.
> Please visit http://swapdisclosure.sgcib.com for important information
> with respect to derivative products.
>                               ************
> Ce message et toutes les pieces jointes (ci-apres le "message") sont
> confidentiels et susceptibles de contenir des informations couvertes
> par le secret professionnel.
> Ce message est etabli a l'intention exclusive de ses destinataires. Toute
> utilisation ou diffusion non autorisee est interdite.
> Tout message electronique est susceptible d'alteration.
> La SOCIETE GENERALE et ses filiales declinent toute responsabilite au
> titre de ce message s'il a ete altere, deforme ou falsifie.
> Veuillez consulter le site http://swapdisclosure.sgcib.com afin de
> recueillir d'importantes informations sur les produits derives.
> *************************************************************************
>
>
>
>
>



-- 
  Olivier Renault
 Solution Engineer , Hortonworks
   Mobile: +44 7500 933 036
 Email: orenault@hortonworks.com
 Website: http://www.hortonworks.com/

<http://www.twitter.com/hortonworks?utm_source=WiseStamp&utm_medium=email&utm_term=&utm_content=&utm_campaign=signature>
<http://www.linkedin.com/company/hortonworks?utm_source=WiseStamp&utm_medium=email&utm_term=&utm_content=&utm_campaign=signature>
<http://www.facebook.com/hortonworks?utm_source=WiseStamp&utm_medium=email&utm_term=&utm_content=&utm_campaign=signature>
 [image: photo]
  Latest From Our Blog:  Accenture and Hortonworks Announce Alliance
<http://hortonworks.com/blog/accenture-hortonworks-alliance/?utm_source=WiseStamp&utm_medium=email&utm_term=&utm_content=&utm_campaign=signature>

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: Need to evaluate a cluster

Posted by Mirko Kämpf <mi...@gmail.com>.
Just request a quote from the leading and also local vendors. Tell them
about the volume and the access pattern you have in mind and collect the
offerings. Than you compare the prices. You should consider space (in the
data center) network architecture and energy cosumption as well as heat
generation in such a cluster which handles some PB on the long run.
Have a look into the book:
http://www.amazon.de/Hadoop-Operations-Eric-Sammer/dp/1449327052
Cheers,
Mirko



2014-07-10 11:10 GMT+02:00 YIMEN YIMGA Gael <ga...@sgcib.com>:

> Thank for your return Mirko,
>
>
>
> In my case, I can consider *compression factor of *8* according to the
> service in charge of it.
>
>
>
> Data, I’m dealing with are : logs only. But it’s many types of logs
> (printing logs, USB logs, Remote access logs, Active Directory logs,
> database servers logs, Web servers logs, Antivirus logs, etc.)
>
> I precise that in my case it’s only logs that are stored. Sometime we
> could have CSV files. But no videos or images are considered here.
>
>
>
> Any advice according to that specific type of data?
>
> What are the reasons to consider servers with 12 HDD (3TB) per server?
> Knowing that, I prefer the LOW-COST.
>
> What could be the price of a LOW-COST server with 12HDD (3TB) ?
>
>
>
> Regards
>
>
>
> *From:* Mirko Kämpf [mailto:mirko.kaempf@gmail.com]
> *Sent:* Thursday 10 July 2014 11:01
>
> *To:* user@hadoop.apache.org
> *Subject:* Re: Need to evaluate a cluster
>
>
>
> I multiply by 1.3 which means I add 30% of the estimated amount to have
> reserved capacity for intermediate data.
>
> In your case with approx. 2TB per day I think, data nodes with 1 to 3
> discs are not a good idea. You should consider servers with more discs and
> than add one per week. Start with 10 servers and 12 HDD (3TB) per server.
> This allows you to handle approx. 35 TB raw uncompressed data. You have to
> evaluated compression in your special case. It can be high, but also not
> very high, if raw data is already compressed somehow. What data are you
> dealing with?
>
> Text, messages, logs or more binary data like images, mp3 oder video
> formats?
>
> Cheers,
> Mirko
>
>
>
> 2014-07-10 10:43 GMT+02:00 YIMEN YIMGA Gael <ga...@sgcib.com>:
>
> Hi,
>
>
>
> What does « 1.3 for overhead » mean in this calculation ?
>
>
>
> Regards
>
>
>
> *From:* Mirko Kämpf [mailto:mirko.kaempf@gmail.com]
> *Sent:* Wednesday 9 July 2014 18:09
>
>
> *To:* user@hadoop.apache.org
> *Subject:* Re: Need to evaluate a cluster
>
>
>
> Hello,
>
>
>
> if I follow your numbers I see one missing fact: *What is the number of
> HDDs per DataNode*?
>
> Let's assume you use machines with 6 x 3TB HDDs per box, you would need
> about 60 DataNodes
>
> per year (0.75 TB per day x 3 for replication x 1.3 for overhead / ( nr of
> HDDs per node x capacity per HDD )).
> With 12 HDD you would only need 30 servers per year.
> How did you calculate the number of 367 datanodes?
>
>
>
> Cheers,
>
> Mirko
>
>
>
> 2014-07-09 17:59 GMT+02:00 YIMEN YIMGA Gael <ga...@sgcib.com>:
>
> Hello Dear,
>
>
>
> I made an estimation of a number of nodes of a cluster that can be
> supplied by 720GB of data/day.
>
> My estimation gave me *367 datanodes* in a year. I’m a bit afraid by that
> amount of datanodes.
>
> The assumptions, I used are the followings :
>
>
>
> -          Daily supply (feed) : 720GB
>
> -          HDFS replication factor: 3
>
> -          Booked space for each disk outside HDFS: 30%
>
> -          Size of a disk: 3TB.
>
>
>
> I have two questions.
>
>
>
> First, I would like to know if my assumptions are well taken?
>
> Secondly, could someone help me to evaluate that cluster, to let me be
> sure that my results are not to excessive, please ?
>
>
>
> Standing by for your feedback
>
>
>
> Warm regard
>
> *************************************************************************
> This message and any attachments (the "message") are confidential,
> intended solely for the addressee(s), and may contain legally privileged
> information.
> Any unauthorised use or dissemination is prohibited. E-mails are
> susceptible to alteration.
> Neither SOCIETE GENERALE nor any of its subsidiaries or affiliates shall
> be liable for the message if altered, changed or
> falsified.
> Please visit http://swapdisclosure.sgcib.com for important information
> with respect to derivative products.
>                               ************
> Ce message et toutes les pieces jointes (ci-apres le "message") sont
> confidentiels et susceptibles de contenir des informations couvertes
> par le secret professionnel.
> Ce message est etabli a l'intention exclusive de ses destinataires. Toute
> utilisation ou diffusion non autorisee est interdite.
> Tout message electronique est susceptible d'alteration.
> La SOCIETE GENERALE et ses filiales declinent toute responsabilite au
> titre de ce message s'il a ete altere, deforme ou falsifie.
> Veuillez consulter le site http://swapdisclosure.sgcib.com afin de
> recueillir d'importantes informations sur les produits derives.
> *************************************************************************
>
>
>
>
>

Re: Need to evaluate a cluster

Posted by Mirko Kämpf <mi...@gmail.com>.
Just request a quote from the leading and also local vendors. Tell them
about the volume and the access pattern you have in mind and collect the
offerings. Than you compare the prices. You should consider space (in the
data center) network architecture and energy cosumption as well as heat
generation in such a cluster which handles some PB on the long run.
Have a look into the book:
http://www.amazon.de/Hadoop-Operations-Eric-Sammer/dp/1449327052
Cheers,
Mirko



2014-07-10 11:10 GMT+02:00 YIMEN YIMGA Gael <ga...@sgcib.com>:

> Thank for your return Mirko,
>
>
>
> In my case, I can consider *compression factor of *8* according to the
> service in charge of it.
>
>
>
> Data, I’m dealing with are : logs only. But it’s many types of logs
> (printing logs, USB logs, Remote access logs, Active Directory logs,
> database servers logs, Web servers logs, Antivirus logs, etc.)
>
> I precise that in my case it’s only logs that are stored. Sometime we
> could have CSV files. But no videos or images are considered here.
>
>
>
> Any advice according to that specific type of data?
>
> What are the reasons to consider servers with 12 HDD (3TB) per server?
> Knowing that, I prefer the LOW-COST.
>
> What could be the price of a LOW-COST server with 12HDD (3TB) ?
>
>
>
> Regards
>
>
>
> *From:* Mirko Kämpf [mailto:mirko.kaempf@gmail.com]
> *Sent:* Thursday 10 July 2014 11:01
>
> *To:* user@hadoop.apache.org
> *Subject:* Re: Need to evaluate a cluster
>
>
>
> I multiply by 1.3 which means I add 30% of the estimated amount to have
> reserved capacity for intermediate data.
>
> In your case with approx. 2TB per day I think, data nodes with 1 to 3
> discs are not a good idea. You should consider servers with more discs and
> than add one per week. Start with 10 servers and 12 HDD (3TB) per server.
> This allows you to handle approx. 35 TB raw uncompressed data. You have to
> evaluated compression in your special case. It can be high, but also not
> very high, if raw data is already compressed somehow. What data are you
> dealing with?
>
> Text, messages, logs or more binary data like images, mp3 oder video
> formats?
>
> Cheers,
> Mirko
>
>
>
> 2014-07-10 10:43 GMT+02:00 YIMEN YIMGA Gael <ga...@sgcib.com>:
>
> Hi,
>
>
>
> What does « 1.3 for overhead » mean in this calculation ?
>
>
>
> Regards
>
>
>
> *From:* Mirko Kämpf [mailto:mirko.kaempf@gmail.com]
> *Sent:* Wednesday 9 July 2014 18:09
>
>
> *To:* user@hadoop.apache.org
> *Subject:* Re: Need to evaluate a cluster
>
>
>
> Hello,
>
>
>
> if I follow your numbers I see one missing fact: *What is the number of
> HDDs per DataNode*?
>
> Let's assume you use machines with 6 x 3TB HDDs per box, you would need
> about 60 DataNodes
>
> per year (0.75 TB per day x 3 for replication x 1.3 for overhead / ( nr of
> HDDs per node x capacity per HDD )).
> With 12 HDD you would only need 30 servers per year.
> How did you calculate the number of 367 datanodes?
>
>
>
> Cheers,
>
> Mirko
>
>
>
> 2014-07-09 17:59 GMT+02:00 YIMEN YIMGA Gael <ga...@sgcib.com>:
>
> Hello Dear,
>
>
>
> I made an estimation of a number of nodes of a cluster that can be
> supplied by 720GB of data/day.
>
> My estimation gave me *367 datanodes* in a year. I’m a bit afraid by that
> amount of datanodes.
>
> The assumptions, I used are the followings :
>
>
>
> -          Daily supply (feed) : 720GB
>
> -          HDFS replication factor: 3
>
> -          Booked space for each disk outside HDFS: 30%
>
> -          Size of a disk: 3TB.
>
>
>
> I have two questions.
>
>
>
> First, I would like to know if my assumptions are well taken?
>
> Secondly, could someone help me to evaluate that cluster, to let me be
> sure that my results are not to excessive, please ?
>
>
>
> Standing by for your feedback
>
>
>
> Warm regard
>
> *************************************************************************
> This message and any attachments (the "message") are confidential,
> intended solely for the addressee(s), and may contain legally privileged
> information.
> Any unauthorised use or dissemination is prohibited. E-mails are
> susceptible to alteration.
> Neither SOCIETE GENERALE nor any of its subsidiaries or affiliates shall
> be liable for the message if altered, changed or
> falsified.
> Please visit http://swapdisclosure.sgcib.com for important information
> with respect to derivative products.
>                               ************
> Ce message et toutes les pieces jointes (ci-apres le "message") sont
> confidentiels et susceptibles de contenir des informations couvertes
> par le secret professionnel.
> Ce message est etabli a l'intention exclusive de ses destinataires. Toute
> utilisation ou diffusion non autorisee est interdite.
> Tout message electronique est susceptible d'alteration.
> La SOCIETE GENERALE et ses filiales declinent toute responsabilite au
> titre de ce message s'il a ete altere, deforme ou falsifie.
> Veuillez consulter le site http://swapdisclosure.sgcib.com afin de
> recueillir d'importantes informations sur les produits derives.
> *************************************************************************
>
>
>
>
>

Re: Need to evaluate a cluster

Posted by Olivier Renault <or...@hortonworks.com>.
Either you spend your money on servers with more disks or your spend your
money on cooling / power consumption and potentially building a new DC ;).

A typical server from a tier 1 vendor ( HP, Dell, IBM, Cisco ) should be
around 5k euros ( fully loaded with HDD ).

Kind regards,
Olivier


On 10 July 2014 11:10, YIMEN YIMGA Gael <ga...@sgcib.com> wrote:

> Thank for your return Mirko,
>
>
>
> In my case, I can consider *compression factor of *8* according to the
> service in charge of it.
>
>
>
> Data, I’m dealing with are : logs only. But it’s many types of logs
> (printing logs, USB logs, Remote access logs, Active Directory logs,
> database servers logs, Web servers logs, Antivirus logs, etc.)
>
> I precise that in my case it’s only logs that are stored. Sometime we
> could have CSV files. But no videos or images are considered here.
>
>
>
> Any advice according to that specific type of data?
>
> What are the reasons to consider servers with 12 HDD (3TB) per server?
> Knowing that, I prefer the LOW-COST.
>
> What could be the price of a LOW-COST server with 12HDD (3TB) ?
>
>
>
> Regards
>
>
>
> *From:* Mirko Kämpf [mailto:mirko.kaempf@gmail.com]
> *Sent:* Thursday 10 July 2014 11:01
>
> *To:* user@hadoop.apache.org
> *Subject:* Re: Need to evaluate a cluster
>
>
>
> I multiply by 1.3 which means I add 30% of the estimated amount to have
> reserved capacity for intermediate data.
>
> In your case with approx. 2TB per day I think, data nodes with 1 to 3
> discs are not a good idea. You should consider servers with more discs and
> than add one per week. Start with 10 servers and 12 HDD (3TB) per server.
> This allows you to handle approx. 35 TB raw uncompressed data. You have to
> evaluated compression in your special case. It can be high, but also not
> very high, if raw data is already compressed somehow. What data are you
> dealing with?
>
> Text, messages, logs or more binary data like images, mp3 oder video
> formats?
>
> Cheers,
> Mirko
>
>
>
> 2014-07-10 10:43 GMT+02:00 YIMEN YIMGA Gael <ga...@sgcib.com>:
>
> Hi,
>
>
>
> What does « 1.3 for overhead » mean in this calculation ?
>
>
>
> Regards
>
>
>
> *From:* Mirko Kämpf [mailto:mirko.kaempf@gmail.com]
> *Sent:* Wednesday 9 July 2014 18:09
>
>
> *To:* user@hadoop.apache.org
> *Subject:* Re: Need to evaluate a cluster
>
>
>
> Hello,
>
>
>
> if I follow your numbers I see one missing fact: *What is the number of
> HDDs per DataNode*?
>
> Let's assume you use machines with 6 x 3TB HDDs per box, you would need
> about 60 DataNodes
>
> per year (0.75 TB per day x 3 for replication x 1.3 for overhead / ( nr of
> HDDs per node x capacity per HDD )).
> With 12 HDD you would only need 30 servers per year.
> How did you calculate the number of 367 datanodes?
>
>
>
> Cheers,
>
> Mirko
>
>
>
> 2014-07-09 17:59 GMT+02:00 YIMEN YIMGA Gael <ga...@sgcib.com>:
>
> Hello Dear,
>
>
>
> I made an estimation of a number of nodes of a cluster that can be
> supplied by 720GB of data/day.
>
> My estimation gave me *367 datanodes* in a year. I’m a bit afraid by that
> amount of datanodes.
>
> The assumptions, I used are the followings :
>
>
>
> -          Daily supply (feed) : 720GB
>
> -          HDFS replication factor: 3
>
> -          Booked space for each disk outside HDFS: 30%
>
> -          Size of a disk: 3TB.
>
>
>
> I have two questions.
>
>
>
> First, I would like to know if my assumptions are well taken?
>
> Secondly, could someone help me to evaluate that cluster, to let me be
> sure that my results are not to excessive, please ?
>
>
>
> Standing by for your feedback
>
>
>
> Warm regard
>
> *************************************************************************
> This message and any attachments (the "message") are confidential,
> intended solely for the addressee(s), and may contain legally privileged
> information.
> Any unauthorised use or dissemination is prohibited. E-mails are
> susceptible to alteration.
> Neither SOCIETE GENERALE nor any of its subsidiaries or affiliates shall
> be liable for the message if altered, changed or
> falsified.
> Please visit http://swapdisclosure.sgcib.com for important information
> with respect to derivative products.
>                               ************
> Ce message et toutes les pieces jointes (ci-apres le "message") sont
> confidentiels et susceptibles de contenir des informations couvertes
> par le secret professionnel.
> Ce message est etabli a l'intention exclusive de ses destinataires. Toute
> utilisation ou diffusion non autorisee est interdite.
> Tout message electronique est susceptible d'alteration.
> La SOCIETE GENERALE et ses filiales declinent toute responsabilite au
> titre de ce message s'il a ete altere, deforme ou falsifie.
> Veuillez consulter le site http://swapdisclosure.sgcib.com afin de
> recueillir d'importantes informations sur les produits derives.
> *************************************************************************
>
>
>
>
>



-- 
  Olivier Renault
 Solution Engineer , Hortonworks
   Mobile: +44 7500 933 036
 Email: orenault@hortonworks.com
 Website: http://www.hortonworks.com/

<http://www.twitter.com/hortonworks?utm_source=WiseStamp&utm_medium=email&utm_term=&utm_content=&utm_campaign=signature>
<http://www.linkedin.com/company/hortonworks?utm_source=WiseStamp&utm_medium=email&utm_term=&utm_content=&utm_campaign=signature>
<http://www.facebook.com/hortonworks?utm_source=WiseStamp&utm_medium=email&utm_term=&utm_content=&utm_campaign=signature>
 [image: photo]
  Latest From Our Blog:  Accenture and Hortonworks Announce Alliance
<http://hortonworks.com/blog/accenture-hortonworks-alliance/?utm_source=WiseStamp&utm_medium=email&utm_term=&utm_content=&utm_campaign=signature>

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: Need to evaluate a cluster

Posted by Olivier Renault <or...@hortonworks.com>.
Either you spend your money on servers with more disks or your spend your
money on cooling / power consumption and potentially building a new DC ;).

A typical server from a tier 1 vendor ( HP, Dell, IBM, Cisco ) should be
around 5k euros ( fully loaded with HDD ).

Kind regards,
Olivier


On 10 July 2014 11:10, YIMEN YIMGA Gael <ga...@sgcib.com> wrote:

> Thank for your return Mirko,
>
>
>
> In my case, I can consider *compression factor of *8* according to the
> service in charge of it.
>
>
>
> Data, I’m dealing with are : logs only. But it’s many types of logs
> (printing logs, USB logs, Remote access logs, Active Directory logs,
> database servers logs, Web servers logs, Antivirus logs, etc.)
>
> I precise that in my case it’s only logs that are stored. Sometime we
> could have CSV files. But no videos or images are considered here.
>
>
>
> Any advice according to that specific type of data?
>
> What are the reasons to consider servers with 12 HDD (3TB) per server?
> Knowing that, I prefer the LOW-COST.
>
> What could be the price of a LOW-COST server with 12HDD (3TB) ?
>
>
>
> Regards
>
>
>
> *From:* Mirko Kämpf [mailto:mirko.kaempf@gmail.com]
> *Sent:* Thursday 10 July 2014 11:01
>
> *To:* user@hadoop.apache.org
> *Subject:* Re: Need to evaluate a cluster
>
>
>
> I multiply by 1.3 which means I add 30% of the estimated amount to have
> reserved capacity for intermediate data.
>
> In your case with approx. 2TB per day I think, data nodes with 1 to 3
> discs are not a good idea. You should consider servers with more discs and
> than add one per week. Start with 10 servers and 12 HDD (3TB) per server.
> This allows you to handle approx. 35 TB raw uncompressed data. You have to
> evaluated compression in your special case. It can be high, but also not
> very high, if raw data is already compressed somehow. What data are you
> dealing with?
>
> Text, messages, logs or more binary data like images, mp3 oder video
> formats?
>
> Cheers,
> Mirko
>
>
>
> 2014-07-10 10:43 GMT+02:00 YIMEN YIMGA Gael <ga...@sgcib.com>:
>
> Hi,
>
>
>
> What does « 1.3 for overhead » mean in this calculation ?
>
>
>
> Regards
>
>
>
> *From:* Mirko Kämpf [mailto:mirko.kaempf@gmail.com]
> *Sent:* Wednesday 9 July 2014 18:09
>
>
> *To:* user@hadoop.apache.org
> *Subject:* Re: Need to evaluate a cluster
>
>
>
> Hello,
>
>
>
> if I follow your numbers I see one missing fact: *What is the number of
> HDDs per DataNode*?
>
> Let's assume you use machines with 6 x 3TB HDDs per box, you would need
> about 60 DataNodes
>
> per year (0.75 TB per day x 3 for replication x 1.3 for overhead / ( nr of
> HDDs per node x capacity per HDD )).
> With 12 HDD you would only need 30 servers per year.
> How did you calculate the number of 367 datanodes?
>
>
>
> Cheers,
>
> Mirko
>
>
>
> 2014-07-09 17:59 GMT+02:00 YIMEN YIMGA Gael <ga...@sgcib.com>:
>
> Hello Dear,
>
>
>
> I made an estimation of a number of nodes of a cluster that can be
> supplied by 720GB of data/day.
>
> My estimation gave me *367 datanodes* in a year. I’m a bit afraid by that
> amount of datanodes.
>
> The assumptions, I used are the followings :
>
>
>
> -          Daily supply (feed) : 720GB
>
> -          HDFS replication factor: 3
>
> -          Booked space for each disk outside HDFS: 30%
>
> -          Size of a disk: 3TB.
>
>
>
> I have two questions.
>
>
>
> First, I would like to know if my assumptions are well taken?
>
> Secondly, could someone help me to evaluate that cluster, to let me be
> sure that my results are not to excessive, please ?
>
>
>
> Standing by for your feedback
>
>
>
> Warm regard
>
> *************************************************************************
> This message and any attachments (the "message") are confidential,
> intended solely for the addressee(s), and may contain legally privileged
> information.
> Any unauthorised use or dissemination is prohibited. E-mails are
> susceptible to alteration.
> Neither SOCIETE GENERALE nor any of its subsidiaries or affiliates shall
> be liable for the message if altered, changed or
> falsified.
> Please visit http://swapdisclosure.sgcib.com for important information
> with respect to derivative products.
>                               ************
> Ce message et toutes les pieces jointes (ci-apres le "message") sont
> confidentiels et susceptibles de contenir des informations couvertes
> par le secret professionnel.
> Ce message est etabli a l'intention exclusive de ses destinataires. Toute
> utilisation ou diffusion non autorisee est interdite.
> Tout message electronique est susceptible d'alteration.
> La SOCIETE GENERALE et ses filiales declinent toute responsabilite au
> titre de ce message s'il a ete altere, deforme ou falsifie.
> Veuillez consulter le site http://swapdisclosure.sgcib.com afin de
> recueillir d'importantes informations sur les produits derives.
> *************************************************************************
>
>
>
>
>



-- 
  Olivier Renault
 Solution Engineer , Hortonworks
   Mobile: +44 7500 933 036
 Email: orenault@hortonworks.com
 Website: http://www.hortonworks.com/

<http://www.twitter.com/hortonworks?utm_source=WiseStamp&utm_medium=email&utm_term=&utm_content=&utm_campaign=signature>
<http://www.linkedin.com/company/hortonworks?utm_source=WiseStamp&utm_medium=email&utm_term=&utm_content=&utm_campaign=signature>
<http://www.facebook.com/hortonworks?utm_source=WiseStamp&utm_medium=email&utm_term=&utm_content=&utm_campaign=signature>
 [image: photo]
  Latest From Our Blog:  Accenture and Hortonworks Announce Alliance
<http://hortonworks.com/blog/accenture-hortonworks-alliance/?utm_source=WiseStamp&utm_medium=email&utm_term=&utm_content=&utm_campaign=signature>

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

RE: Need to evaluate a cluster

Posted by YIMEN YIMGA Gael <ga...@sgcib.com>.
Thank for your return Mirko,

In my case, I can consider compression factor of *8 according to the service in charge of it.

Data, I’m dealing with are : logs only. But it’s many types of logs (printing logs, USB logs, Remote access logs, Active Directory logs, database servers logs, Web servers logs, Antivirus logs, etc.)
I precise that in my case it’s only logs that are stored. Sometime we could have CSV files. But no videos or images are considered here.

Any advice according to that specific type of data?
What are the reasons to consider servers with 12 HDD (3TB) per server? Knowing that, I prefer the LOW-COST.
What could be the price of a LOW-COST server with 12HDD (3TB) ?

Regards

From: Mirko Kämpf [mailto:mirko.kaempf@gmail.com]
Sent: Thursday 10 July 2014 11:01
To: user@hadoop.apache.org
Subject: Re: Need to evaluate a cluster

I multiply by 1.3 which means I add 30% of the estimated amount to have reserved capacity for intermediate data.
In your case with approx. 2TB per day I think, data nodes with 1 to 3 discs are not a good idea. You should consider servers with more discs and than add one per week. Start with 10 servers and 12 HDD (3TB) per server. This allows you to handle approx. 35 TB raw uncompressed data. You have to evaluated compression in your special case. It can be high, but also not very high, if raw data is already compressed somehow. What data are you dealing with?
Text, messages, logs or more binary data like images, mp3 oder video formats?
Cheers,
Mirko

2014-07-10 10:43 GMT+02:00 YIMEN YIMGA Gael <ga...@sgcib.com>>:
Hi,

What does « 1.3 for overhead » mean in this calculation ?

Regards

From: Mirko Kämpf [mailto:mirko.kaempf@gmail.com<ma...@gmail.com>]
Sent: Wednesday 9 July 2014 18:09

To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Re: Need to evaluate a cluster

Hello,

if I follow your numbers I see one missing fact: What is the number of HDDs per DataNode?
Let's assume you use machines with 6 x 3TB HDDs per box, you would need about 60 DataNodes
per year (0.75 TB per day x 3 for replication x 1.3 for overhead / ( nr of HDDs per node x capacity per HDD )).
With 12 HDD you would only need 30 servers per year.
How did you calculate the number of 367 datanodes?

Cheers,
Mirko

2014-07-09 17:59 GMT+02:00 YIMEN YIMGA Gael <ga...@sgcib.com>>:
Hello Dear,

I made an estimation of a number of nodes of a cluster that can be supplied by 720GB of data/day.
My estimation gave me 367 datanodes in a year. I’m a bit afraid by that amount of datanodes.
The assumptions, I used are the followings :


-          Daily supply (feed) : 720GB

-          HDFS replication factor: 3

-          Booked space for each disk outside HDFS: 30%

-          Size of a disk: 3TB.

I have two questions.

First, I would like to know if my assumptions are well taken?
Secondly, could someone help me to evaluate that cluster, to let me be sure that my results are not to excessive, please ?

Standing by for your feedback

Warm regard

*************************************************************************
This message and any attachments (the "message") are confidential, intended solely for the addressee(s), and may contain legally privileged information.
Any unauthorised use or dissemination is prohibited. E-mails are susceptible to alteration.
Neither SOCIETE GENERALE nor any of its subsidiaries or affiliates shall be liable for the message if altered, changed or
falsified.
Please visit http://swapdisclosure.sgcib.com for important information with respect to derivative products.
                              ************
Ce message et toutes les pieces jointes (ci-apres le "message") sont confidentiels et susceptibles de contenir des informations couvertes
par le secret professionnel.
Ce message est etabli a l'intention exclusive de ses destinataires. Toute utilisation ou diffusion non autorisee est interdite.
Tout message electronique est susceptible d'alteration.
La SOCIETE GENERALE et ses filiales declinent toute responsabilite au titre de ce message s'il a ete altere, deforme ou falsifie.
Veuillez consulter le site http://swapdisclosure.sgcib.com afin de recueillir d'importantes informations sur les produits derives.
*************************************************************************



RE: Need to evaluate a cluster

Posted by YIMEN YIMGA Gael <ga...@sgcib.com>.
In addition, when I applied the Compression factor of 8, I have as daily feeds : 87GB/day

From: YIMEN YIMGA Gael ItecCsySat
Sent: Thursday 10 July 2014 11:11
To: user@hadoop.apache.org
Subject: RE: Need to evaluate a cluster

Thank for your return Mirko,

In my case, I can consider compression factor of *8 according to the service in charge of it.

Data, I’m dealing with are : logs only. But it’s many types of logs (printing logs, USB logs, Remote access logs, Active Directory logs, database servers logs, Web servers logs, Antivirus logs, etc.)
I precise that in my case it’s only logs that are stored. Sometime we could have CSV files. But no videos or images are considered here.

Any advice according to that specific type of data?
What are the reasons to consider servers with 12 HDD (3TB) per server? Knowing that, I prefer the LOW-COST.
What could be the price of a LOW-COST server with 12HDD (3TB) ?

Regards

From: Mirko Kämpf [mailto:mirko.kaempf@gmail.com]
Sent: Thursday 10 July 2014 11:01
To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Re: Need to evaluate a cluster

I multiply by 1.3 which means I add 30% of the estimated amount to have reserved capacity for intermediate data.
In your case with approx. 2TB per day I think, data nodes with 1 to 3 discs are not a good idea. You should consider servers with more discs and than add one per week. Start with 10 servers and 12 HDD (3TB) per server. This allows you to handle approx. 35 TB raw uncompressed data. You have to evaluated compression in your special case. It can be high, but also not very high, if raw data is already compressed somehow. What data are you dealing with?
Text, messages, logs or more binary data like images, mp3 oder video formats?
Cheers,
Mirko

2014-07-10 10:43 GMT+02:00 YIMEN YIMGA Gael <ga...@sgcib.com>>:
Hi,

What does « 1.3 for overhead » mean in this calculation ?

Regards

From: Mirko Kämpf [mailto:mirko.kaempf@gmail.com<ma...@gmail.com>]
Sent: Wednesday 9 July 2014 18:09

To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Re: Need to evaluate a cluster

Hello,

if I follow your numbers I see one missing fact: What is the number of HDDs per DataNode?
Let's assume you use machines with 6 x 3TB HDDs per box, you would need about 60 DataNodes
per year (0.75 TB per day x 3 for replication x 1.3 for overhead / ( nr of HDDs per node x capacity per HDD )).
With 12 HDD you would only need 30 servers per year.
How did you calculate the number of 367 datanodes?

Cheers,
Mirko

2014-07-09 17:59 GMT+02:00 YIMEN YIMGA Gael <ga...@sgcib.com>>:
Hello Dear,

I made an estimation of a number of nodes of a cluster that can be supplied by 720GB of data/day.
My estimation gave me 367 datanodes in a year. I’m a bit afraid by that amount of datanodes.
The assumptions, I used are the followings :


-          Daily supply (feed) : 720GB

-          HDFS replication factor: 3

-          Booked space for each disk outside HDFS: 30%

-          Size of a disk: 3TB.

I have two questions.

First, I would like to know if my assumptions are well taken?
Secondly, could someone help me to evaluate that cluster, to let me be sure that my results are not to excessive, please ?

Standing by for your feedback

Warm regard

*************************************************************************
This message and any attachments (the "message") are confidential, intended solely for the addressee(s), and may contain legally privileged information.
Any unauthorised use or dissemination is prohibited. E-mails are susceptible to alteration.
Neither SOCIETE GENERALE nor any of its subsidiaries or affiliates shall be liable for the message if altered, changed or
falsified.
Please visit http://swapdisclosure.sgcib.com for important information with respect to derivative products.
                              ************
Ce message et toutes les pieces jointes (ci-apres le "message") sont confidentiels et susceptibles de contenir des informations couvertes
par le secret professionnel.
Ce message est etabli a l'intention exclusive de ses destinataires. Toute utilisation ou diffusion non autorisee est interdite.
Tout message electronique est susceptible d'alteration.
La SOCIETE GENERALE et ses filiales declinent toute responsabilite au titre de ce message s'il a ete altere, deforme ou falsifie.
Veuillez consulter le site http://swapdisclosure.sgcib.com afin de recueillir d'importantes informations sur les produits derives.
*************************************************************************



RE: Need to evaluate a cluster

Posted by YIMEN YIMGA Gael <ga...@sgcib.com>.
Thank for your return Mirko,

In my case, I can consider compression factor of *8 according to the service in charge of it.

Data, I’m dealing with are : logs only. But it’s many types of logs (printing logs, USB logs, Remote access logs, Active Directory logs, database servers logs, Web servers logs, Antivirus logs, etc.)
I precise that in my case it’s only logs that are stored. Sometime we could have CSV files. But no videos or images are considered here.

Any advice according to that specific type of data?
What are the reasons to consider servers with 12 HDD (3TB) per server? Knowing that, I prefer the LOW-COST.
What could be the price of a LOW-COST server with 12HDD (3TB) ?

Regards

From: Mirko Kämpf [mailto:mirko.kaempf@gmail.com]
Sent: Thursday 10 July 2014 11:01
To: user@hadoop.apache.org
Subject: Re: Need to evaluate a cluster

I multiply by 1.3 which means I add 30% of the estimated amount to have reserved capacity for intermediate data.
In your case with approx. 2TB per day I think, data nodes with 1 to 3 discs are not a good idea. You should consider servers with more discs and than add one per week. Start with 10 servers and 12 HDD (3TB) per server. This allows you to handle approx. 35 TB raw uncompressed data. You have to evaluated compression in your special case. It can be high, but also not very high, if raw data is already compressed somehow. What data are you dealing with?
Text, messages, logs or more binary data like images, mp3 oder video formats?
Cheers,
Mirko

2014-07-10 10:43 GMT+02:00 YIMEN YIMGA Gael <ga...@sgcib.com>>:
Hi,

What does « 1.3 for overhead » mean in this calculation ?

Regards

From: Mirko Kämpf [mailto:mirko.kaempf@gmail.com<ma...@gmail.com>]
Sent: Wednesday 9 July 2014 18:09

To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Re: Need to evaluate a cluster

Hello,

if I follow your numbers I see one missing fact: What is the number of HDDs per DataNode?
Let's assume you use machines with 6 x 3TB HDDs per box, you would need about 60 DataNodes
per year (0.75 TB per day x 3 for replication x 1.3 for overhead / ( nr of HDDs per node x capacity per HDD )).
With 12 HDD you would only need 30 servers per year.
How did you calculate the number of 367 datanodes?

Cheers,
Mirko

2014-07-09 17:59 GMT+02:00 YIMEN YIMGA Gael <ga...@sgcib.com>>:
Hello Dear,

I made an estimation of a number of nodes of a cluster that can be supplied by 720GB of data/day.
My estimation gave me 367 datanodes in a year. I’m a bit afraid by that amount of datanodes.
The assumptions, I used are the followings :


-          Daily supply (feed) : 720GB

-          HDFS replication factor: 3

-          Booked space for each disk outside HDFS: 30%

-          Size of a disk: 3TB.

I have two questions.

First, I would like to know if my assumptions are well taken?
Secondly, could someone help me to evaluate that cluster, to let me be sure that my results are not to excessive, please ?

Standing by for your feedback

Warm regard

*************************************************************************
This message and any attachments (the "message") are confidential, intended solely for the addressee(s), and may contain legally privileged information.
Any unauthorised use or dissemination is prohibited. E-mails are susceptible to alteration.
Neither SOCIETE GENERALE nor any of its subsidiaries or affiliates shall be liable for the message if altered, changed or
falsified.
Please visit http://swapdisclosure.sgcib.com for important information with respect to derivative products.
                              ************
Ce message et toutes les pieces jointes (ci-apres le "message") sont confidentiels et susceptibles de contenir des informations couvertes
par le secret professionnel.
Ce message est etabli a l'intention exclusive de ses destinataires. Toute utilisation ou diffusion non autorisee est interdite.
Tout message electronique est susceptible d'alteration.
La SOCIETE GENERALE et ses filiales declinent toute responsabilite au titre de ce message s'il a ete altere, deforme ou falsifie.
Veuillez consulter le site http://swapdisclosure.sgcib.com afin de recueillir d'importantes informations sur les produits derives.
*************************************************************************



RE: Need to evaluate a cluster

Posted by YIMEN YIMGA Gael <ga...@sgcib.com>.
Thank for your return Mirko,

In my case, I can consider compression factor of *8 according to the service in charge of it.

Data, I’m dealing with are : logs only. But it’s many types of logs (printing logs, USB logs, Remote access logs, Active Directory logs, database servers logs, Web servers logs, Antivirus logs, etc.)
I precise that in my case it’s only logs that are stored. Sometime we could have CSV files. But no videos or images are considered here.

Any advice according to that specific type of data?
What are the reasons to consider servers with 12 HDD (3TB) per server? Knowing that, I prefer the LOW-COST.
What could be the price of a LOW-COST server with 12HDD (3TB) ?

Regards

From: Mirko Kämpf [mailto:mirko.kaempf@gmail.com]
Sent: Thursday 10 July 2014 11:01
To: user@hadoop.apache.org
Subject: Re: Need to evaluate a cluster

I multiply by 1.3 which means I add 30% of the estimated amount to have reserved capacity for intermediate data.
In your case with approx. 2TB per day I think, data nodes with 1 to 3 discs are not a good idea. You should consider servers with more discs and than add one per week. Start with 10 servers and 12 HDD (3TB) per server. This allows you to handle approx. 35 TB raw uncompressed data. You have to evaluated compression in your special case. It can be high, but also not very high, if raw data is already compressed somehow. What data are you dealing with?
Text, messages, logs or more binary data like images, mp3 oder video formats?
Cheers,
Mirko

2014-07-10 10:43 GMT+02:00 YIMEN YIMGA Gael <ga...@sgcib.com>>:
Hi,

What does « 1.3 for overhead » mean in this calculation ?

Regards

From: Mirko Kämpf [mailto:mirko.kaempf@gmail.com<ma...@gmail.com>]
Sent: Wednesday 9 July 2014 18:09

To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Re: Need to evaluate a cluster

Hello,

if I follow your numbers I see one missing fact: What is the number of HDDs per DataNode?
Let's assume you use machines with 6 x 3TB HDDs per box, you would need about 60 DataNodes
per year (0.75 TB per day x 3 for replication x 1.3 for overhead / ( nr of HDDs per node x capacity per HDD )).
With 12 HDD you would only need 30 servers per year.
How did you calculate the number of 367 datanodes?

Cheers,
Mirko

2014-07-09 17:59 GMT+02:00 YIMEN YIMGA Gael <ga...@sgcib.com>>:
Hello Dear,

I made an estimation of a number of nodes of a cluster that can be supplied by 720GB of data/day.
My estimation gave me 367 datanodes in a year. I’m a bit afraid by that amount of datanodes.
The assumptions, I used are the followings :


-          Daily supply (feed) : 720GB

-          HDFS replication factor: 3

-          Booked space for each disk outside HDFS: 30%

-          Size of a disk: 3TB.

I have two questions.

First, I would like to know if my assumptions are well taken?
Secondly, could someone help me to evaluate that cluster, to let me be sure that my results are not to excessive, please ?

Standing by for your feedback

Warm regard

*************************************************************************
This message and any attachments (the "message") are confidential, intended solely for the addressee(s), and may contain legally privileged information.
Any unauthorised use or dissemination is prohibited. E-mails are susceptible to alteration.
Neither SOCIETE GENERALE nor any of its subsidiaries or affiliates shall be liable for the message if altered, changed or
falsified.
Please visit http://swapdisclosure.sgcib.com for important information with respect to derivative products.
                              ************
Ce message et toutes les pieces jointes (ci-apres le "message") sont confidentiels et susceptibles de contenir des informations couvertes
par le secret professionnel.
Ce message est etabli a l'intention exclusive de ses destinataires. Toute utilisation ou diffusion non autorisee est interdite.
Tout message electronique est susceptible d'alteration.
La SOCIETE GENERALE et ses filiales declinent toute responsabilite au titre de ce message s'il a ete altere, deforme ou falsifie.
Veuillez consulter le site http://swapdisclosure.sgcib.com afin de recueillir d'importantes informations sur les produits derives.
*************************************************************************



RE: Need to evaluate a cluster

Posted by YIMEN YIMGA Gael <ga...@sgcib.com>.
Thank for your return Mirko,

In my case, I can consider compression factor of *8 according to the service in charge of it.

Data, I’m dealing with are : logs only. But it’s many types of logs (printing logs, USB logs, Remote access logs, Active Directory logs, database servers logs, Web servers logs, Antivirus logs, etc.)
I precise that in my case it’s only logs that are stored. Sometime we could have CSV files. But no videos or images are considered here.

Any advice according to that specific type of data?
What are the reasons to consider servers with 12 HDD (3TB) per server? Knowing that, I prefer the LOW-COST.
What could be the price of a LOW-COST server with 12HDD (3TB) ?

Regards

From: Mirko Kämpf [mailto:mirko.kaempf@gmail.com]
Sent: Thursday 10 July 2014 11:01
To: user@hadoop.apache.org
Subject: Re: Need to evaluate a cluster

I multiply by 1.3 which means I add 30% of the estimated amount to have reserved capacity for intermediate data.
In your case with approx. 2TB per day I think, data nodes with 1 to 3 discs are not a good idea. You should consider servers with more discs and than add one per week. Start with 10 servers and 12 HDD (3TB) per server. This allows you to handle approx. 35 TB raw uncompressed data. You have to evaluated compression in your special case. It can be high, but also not very high, if raw data is already compressed somehow. What data are you dealing with?
Text, messages, logs or more binary data like images, mp3 oder video formats?
Cheers,
Mirko

2014-07-10 10:43 GMT+02:00 YIMEN YIMGA Gael <ga...@sgcib.com>>:
Hi,

What does « 1.3 for overhead » mean in this calculation ?

Regards

From: Mirko Kämpf [mailto:mirko.kaempf@gmail.com<ma...@gmail.com>]
Sent: Wednesday 9 July 2014 18:09

To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Re: Need to evaluate a cluster

Hello,

if I follow your numbers I see one missing fact: What is the number of HDDs per DataNode?
Let's assume you use machines with 6 x 3TB HDDs per box, you would need about 60 DataNodes
per year (0.75 TB per day x 3 for replication x 1.3 for overhead / ( nr of HDDs per node x capacity per HDD )).
With 12 HDD you would only need 30 servers per year.
How did you calculate the number of 367 datanodes?

Cheers,
Mirko

2014-07-09 17:59 GMT+02:00 YIMEN YIMGA Gael <ga...@sgcib.com>>:
Hello Dear,

I made an estimation of a number of nodes of a cluster that can be supplied by 720GB of data/day.
My estimation gave me 367 datanodes in a year. I’m a bit afraid by that amount of datanodes.
The assumptions, I used are the followings :


-          Daily supply (feed) : 720GB

-          HDFS replication factor: 3

-          Booked space for each disk outside HDFS: 30%

-          Size of a disk: 3TB.

I have two questions.

First, I would like to know if my assumptions are well taken?
Secondly, could someone help me to evaluate that cluster, to let me be sure that my results are not to excessive, please ?

Standing by for your feedback

Warm regard

*************************************************************************
This message and any attachments (the "message") are confidential, intended solely for the addressee(s), and may contain legally privileged information.
Any unauthorised use or dissemination is prohibited. E-mails are susceptible to alteration.
Neither SOCIETE GENERALE nor any of its subsidiaries or affiliates shall be liable for the message if altered, changed or
falsified.
Please visit http://swapdisclosure.sgcib.com for important information with respect to derivative products.
                              ************
Ce message et toutes les pieces jointes (ci-apres le "message") sont confidentiels et susceptibles de contenir des informations couvertes
par le secret professionnel.
Ce message est etabli a l'intention exclusive de ses destinataires. Toute utilisation ou diffusion non autorisee est interdite.
Tout message electronique est susceptible d'alteration.
La SOCIETE GENERALE et ses filiales declinent toute responsabilite au titre de ce message s'il a ete altere, deforme ou falsifie.
Veuillez consulter le site http://swapdisclosure.sgcib.com afin de recueillir d'importantes informations sur les produits derives.
*************************************************************************



Re: Need to evaluate a cluster

Posted by Mirko Kämpf <mi...@gmail.com>.
I multiply by 1.3 which means I add 30% of the estimated amount to have
reserved capacity for intermediate data.
In your case with approx. 2TB per day I think, data nodes with 1 to 3 discs
are not a good idea. You should consider servers with more discs and than
add one per week. Start with 10 servers and 12 HDD (3TB) per server. This
allows you to handle approx. 35 TB raw uncompressed data. You have to
evaluated compression in your special case. It can be high, but also not
very high, if raw data is already compressed somehow. What data are you
dealing with?
Text, messages, logs or more binary data like images, mp3 oder video
formats?

Cheers,
Mirko



2014-07-10 10:43 GMT+02:00 YIMEN YIMGA Gael <ga...@sgcib.com>:

> Hi,
>
>
>
> What does « 1.3 for overhead » mean in this calculation ?
>
>
>
> Regards
>
>
>
> *From:* Mirko Kämpf [mailto:mirko.kaempf@gmail.com]
> *Sent:* Wednesday 9 July 2014 18:09
>
> *To:* user@hadoop.apache.org
> *Subject:* Re: Need to evaluate a cluster
>
>
>
> Hello,
>
>
>
> if I follow your numbers I see one missing fact: *What is the number of
> HDDs per DataNode*?
>
> Let's assume you use machines with 6 x 3TB HDDs per box, you would need
> about 60 DataNodes
>
> per year (0.75 TB per day x 3 for replication x 1.3 for overhead / ( nr of
> HDDs per node x capacity per HDD )).
> With 12 HDD you would only need 30 servers per year.
> How did you calculate the number of 367 datanodes?
>
>
>
> Cheers,
>
> Mirko
>
>
>
> 2014-07-09 17:59 GMT+02:00 YIMEN YIMGA Gael <ga...@sgcib.com>:
>
> Hello Dear,
>
>
>
> I made an estimation of a number of nodes of a cluster that can be
> supplied by 720GB of data/day.
>
> My estimation gave me *367 datanodes* in a year. I’m a bit afraid by that
> amount of datanodes.
>
> The assumptions, I used are the followings :
>
>
>
> -          Daily supply (feed) : 720GB
>
> -          HDFS replication factor: 3
>
> -          Booked space for each disk outside HDFS: 30%
>
> -          Size of a disk: 3TB.
>
>
>
> I have two questions.
>
>
>
> First, I would like to know if my assumptions are well taken?
>
> Secondly, could someone help me to evaluate that cluster, to let me be
> sure that my results are not to excessive, please ?
>
>
>
> Standing by for your feedback
>
>
>
> Warm regard
>
> *************************************************************************
> This message and any attachments (the "message") are confidential,
> intended solely for the addressee(s), and may contain legally privileged
> information.
> Any unauthorised use or dissemination is prohibited. E-mails are
> susceptible to alteration.
> Neither SOCIETE GENERALE nor any of its subsidiaries or affiliates shall
> be liable for the message if altered, changed or
> falsified.
> Please visit http://swapdisclosure.sgcib.com for important information
> with respect to derivative products.
>                               ************
> Ce message et toutes les pieces jointes (ci-apres le "message") sont
> confidentiels et susceptibles de contenir des informations couvertes
> par le secret professionnel.
> Ce message est etabli a l'intention exclusive de ses destinataires. Toute
> utilisation ou diffusion non autorisee est interdite.
> Tout message electronique est susceptible d'alteration.
> La SOCIETE GENERALE et ses filiales declinent toute responsabilite au
> titre de ce message s'il a ete altere, deforme ou falsifie.
> Veuillez consulter le site http://swapdisclosure.sgcib.com afin de
> recueillir d'importantes informations sur les produits derives.
> *************************************************************************
>
>
>

Re: Need to evaluate a cluster

Posted by Mirko Kämpf <mi...@gmail.com>.
I multiply by 1.3 which means I add 30% of the estimated amount to have
reserved capacity for intermediate data.
In your case with approx. 2TB per day I think, data nodes with 1 to 3 discs
are not a good idea. You should consider servers with more discs and than
add one per week. Start with 10 servers and 12 HDD (3TB) per server. This
allows you to handle approx. 35 TB raw uncompressed data. You have to
evaluated compression in your special case. It can be high, but also not
very high, if raw data is already compressed somehow. What data are you
dealing with?
Text, messages, logs or more binary data like images, mp3 oder video
formats?

Cheers,
Mirko



2014-07-10 10:43 GMT+02:00 YIMEN YIMGA Gael <ga...@sgcib.com>:

> Hi,
>
>
>
> What does « 1.3 for overhead » mean in this calculation ?
>
>
>
> Regards
>
>
>
> *From:* Mirko Kämpf [mailto:mirko.kaempf@gmail.com]
> *Sent:* Wednesday 9 July 2014 18:09
>
> *To:* user@hadoop.apache.org
> *Subject:* Re: Need to evaluate a cluster
>
>
>
> Hello,
>
>
>
> if I follow your numbers I see one missing fact: *What is the number of
> HDDs per DataNode*?
>
> Let's assume you use machines with 6 x 3TB HDDs per box, you would need
> about 60 DataNodes
>
> per year (0.75 TB per day x 3 for replication x 1.3 for overhead / ( nr of
> HDDs per node x capacity per HDD )).
> With 12 HDD you would only need 30 servers per year.
> How did you calculate the number of 367 datanodes?
>
>
>
> Cheers,
>
> Mirko
>
>
>
> 2014-07-09 17:59 GMT+02:00 YIMEN YIMGA Gael <ga...@sgcib.com>:
>
> Hello Dear,
>
>
>
> I made an estimation of a number of nodes of a cluster that can be
> supplied by 720GB of data/day.
>
> My estimation gave me *367 datanodes* in a year. I’m a bit afraid by that
> amount of datanodes.
>
> The assumptions, I used are the followings :
>
>
>
> -          Daily supply (feed) : 720GB
>
> -          HDFS replication factor: 3
>
> -          Booked space for each disk outside HDFS: 30%
>
> -          Size of a disk: 3TB.
>
>
>
> I have two questions.
>
>
>
> First, I would like to know if my assumptions are well taken?
>
> Secondly, could someone help me to evaluate that cluster, to let me be
> sure that my results are not to excessive, please ?
>
>
>
> Standing by for your feedback
>
>
>
> Warm regard
>
> *************************************************************************
> This message and any attachments (the "message") are confidential,
> intended solely for the addressee(s), and may contain legally privileged
> information.
> Any unauthorised use or dissemination is prohibited. E-mails are
> susceptible to alteration.
> Neither SOCIETE GENERALE nor any of its subsidiaries or affiliates shall
> be liable for the message if altered, changed or
> falsified.
> Please visit http://swapdisclosure.sgcib.com for important information
> with respect to derivative products.
>                               ************
> Ce message et toutes les pieces jointes (ci-apres le "message") sont
> confidentiels et susceptibles de contenir des informations couvertes
> par le secret professionnel.
> Ce message est etabli a l'intention exclusive de ses destinataires. Toute
> utilisation ou diffusion non autorisee est interdite.
> Tout message electronique est susceptible d'alteration.
> La SOCIETE GENERALE et ses filiales declinent toute responsabilite au
> titre de ce message s'il a ete altere, deforme ou falsifie.
> Veuillez consulter le site http://swapdisclosure.sgcib.com afin de
> recueillir d'importantes informations sur les produits derives.
> *************************************************************************
>
>
>

Re: Need to evaluate a cluster

Posted by Mirko Kämpf <mi...@gmail.com>.
I multiply by 1.3 which means I add 30% of the estimated amount to have
reserved capacity for intermediate data.
In your case with approx. 2TB per day I think, data nodes with 1 to 3 discs
are not a good idea. You should consider servers with more discs and than
add one per week. Start with 10 servers and 12 HDD (3TB) per server. This
allows you to handle approx. 35 TB raw uncompressed data. You have to
evaluated compression in your special case. It can be high, but also not
very high, if raw data is already compressed somehow. What data are you
dealing with?
Text, messages, logs or more binary data like images, mp3 oder video
formats?

Cheers,
Mirko



2014-07-10 10:43 GMT+02:00 YIMEN YIMGA Gael <ga...@sgcib.com>:

> Hi,
>
>
>
> What does « 1.3 for overhead » mean in this calculation ?
>
>
>
> Regards
>
>
>
> *From:* Mirko Kämpf [mailto:mirko.kaempf@gmail.com]
> *Sent:* Wednesday 9 July 2014 18:09
>
> *To:* user@hadoop.apache.org
> *Subject:* Re: Need to evaluate a cluster
>
>
>
> Hello,
>
>
>
> if I follow your numbers I see one missing fact: *What is the number of
> HDDs per DataNode*?
>
> Let's assume you use machines with 6 x 3TB HDDs per box, you would need
> about 60 DataNodes
>
> per year (0.75 TB per day x 3 for replication x 1.3 for overhead / ( nr of
> HDDs per node x capacity per HDD )).
> With 12 HDD you would only need 30 servers per year.
> How did you calculate the number of 367 datanodes?
>
>
>
> Cheers,
>
> Mirko
>
>
>
> 2014-07-09 17:59 GMT+02:00 YIMEN YIMGA Gael <ga...@sgcib.com>:
>
> Hello Dear,
>
>
>
> I made an estimation of a number of nodes of a cluster that can be
> supplied by 720GB of data/day.
>
> My estimation gave me *367 datanodes* in a year. I’m a bit afraid by that
> amount of datanodes.
>
> The assumptions, I used are the followings :
>
>
>
> -          Daily supply (feed) : 720GB
>
> -          HDFS replication factor: 3
>
> -          Booked space for each disk outside HDFS: 30%
>
> -          Size of a disk: 3TB.
>
>
>
> I have two questions.
>
>
>
> First, I would like to know if my assumptions are well taken?
>
> Secondly, could someone help me to evaluate that cluster, to let me be
> sure that my results are not to excessive, please ?
>
>
>
> Standing by for your feedback
>
>
>
> Warm regard
>
> *************************************************************************
> This message and any attachments (the "message") are confidential,
> intended solely for the addressee(s), and may contain legally privileged
> information.
> Any unauthorised use or dissemination is prohibited. E-mails are
> susceptible to alteration.
> Neither SOCIETE GENERALE nor any of its subsidiaries or affiliates shall
> be liable for the message if altered, changed or
> falsified.
> Please visit http://swapdisclosure.sgcib.com for important information
> with respect to derivative products.
>                               ************
> Ce message et toutes les pieces jointes (ci-apres le "message") sont
> confidentiels et susceptibles de contenir des informations couvertes
> par le secret professionnel.
> Ce message est etabli a l'intention exclusive de ses destinataires. Toute
> utilisation ou diffusion non autorisee est interdite.
> Tout message electronique est susceptible d'alteration.
> La SOCIETE GENERALE et ses filiales declinent toute responsabilite au
> titre de ce message s'il a ete altere, deforme ou falsifie.
> Veuillez consulter le site http://swapdisclosure.sgcib.com afin de
> recueillir d'importantes informations sur les produits derives.
> *************************************************************************
>
>
>

Re: Need to evaluate a cluster

Posted by Mirko Kämpf <mi...@gmail.com>.
I multiply by 1.3 which means I add 30% of the estimated amount to have
reserved capacity for intermediate data.
In your case with approx. 2TB per day I think, data nodes with 1 to 3 discs
are not a good idea. You should consider servers with more discs and than
add one per week. Start with 10 servers and 12 HDD (3TB) per server. This
allows you to handle approx. 35 TB raw uncompressed data. You have to
evaluated compression in your special case. It can be high, but also not
very high, if raw data is already compressed somehow. What data are you
dealing with?
Text, messages, logs or more binary data like images, mp3 oder video
formats?

Cheers,
Mirko



2014-07-10 10:43 GMT+02:00 YIMEN YIMGA Gael <ga...@sgcib.com>:

> Hi,
>
>
>
> What does « 1.3 for overhead » mean in this calculation ?
>
>
>
> Regards
>
>
>
> *From:* Mirko Kämpf [mailto:mirko.kaempf@gmail.com]
> *Sent:* Wednesday 9 July 2014 18:09
>
> *To:* user@hadoop.apache.org
> *Subject:* Re: Need to evaluate a cluster
>
>
>
> Hello,
>
>
>
> if I follow your numbers I see one missing fact: *What is the number of
> HDDs per DataNode*?
>
> Let's assume you use machines with 6 x 3TB HDDs per box, you would need
> about 60 DataNodes
>
> per year (0.75 TB per day x 3 for replication x 1.3 for overhead / ( nr of
> HDDs per node x capacity per HDD )).
> With 12 HDD you would only need 30 servers per year.
> How did you calculate the number of 367 datanodes?
>
>
>
> Cheers,
>
> Mirko
>
>
>
> 2014-07-09 17:59 GMT+02:00 YIMEN YIMGA Gael <ga...@sgcib.com>:
>
> Hello Dear,
>
>
>
> I made an estimation of a number of nodes of a cluster that can be
> supplied by 720GB of data/day.
>
> My estimation gave me *367 datanodes* in a year. I’m a bit afraid by that
> amount of datanodes.
>
> The assumptions, I used are the followings :
>
>
>
> -          Daily supply (feed) : 720GB
>
> -          HDFS replication factor: 3
>
> -          Booked space for each disk outside HDFS: 30%
>
> -          Size of a disk: 3TB.
>
>
>
> I have two questions.
>
>
>
> First, I would like to know if my assumptions are well taken?
>
> Secondly, could someone help me to evaluate that cluster, to let me be
> sure that my results are not to excessive, please ?
>
>
>
> Standing by for your feedback
>
>
>
> Warm regard
>
> *************************************************************************
> This message and any attachments (the "message") are confidential,
> intended solely for the addressee(s), and may contain legally privileged
> information.
> Any unauthorised use or dissemination is prohibited. E-mails are
> susceptible to alteration.
> Neither SOCIETE GENERALE nor any of its subsidiaries or affiliates shall
> be liable for the message if altered, changed or
> falsified.
> Please visit http://swapdisclosure.sgcib.com for important information
> with respect to derivative products.
>                               ************
> Ce message et toutes les pieces jointes (ci-apres le "message") sont
> confidentiels et susceptibles de contenir des informations couvertes
> par le secret professionnel.
> Ce message est etabli a l'intention exclusive de ses destinataires. Toute
> utilisation ou diffusion non autorisee est interdite.
> Tout message electronique est susceptible d'alteration.
> La SOCIETE GENERALE et ses filiales declinent toute responsabilite au
> titre de ce message s'il a ete altere, deforme ou falsifie.
> Veuillez consulter le site http://swapdisclosure.sgcib.com afin de
> recueillir d'importantes informations sur les produits derives.
> *************************************************************************
>
>
>

RE: Need to evaluate a cluster

Posted by YIMEN YIMGA Gael <ga...@sgcib.com>.
Hi,

What does « 1.3 for overhead » mean in this calculation ?

Regards

From: Mirko Kämpf [mailto:mirko.kaempf@gmail.com]
Sent: Wednesday 9 July 2014 18:09
To: user@hadoop.apache.org
Subject: Re: Need to evaluate a cluster

Hello,

if I follow your numbers I see one missing fact: What is the number of HDDs per DataNode?
Let's assume you use machines with 6 x 3TB HDDs per box, you would need about 60 DataNodes
per year (0.75 TB per day x 3 for replication x 1.3 for overhead / ( nr of HDDs per node x capacity per HDD )).
With 12 HDD you would only need 30 servers per year.
How did you calculate the number of 367 datanodes?

Cheers,
Mirko

2014-07-09 17:59 GMT+02:00 YIMEN YIMGA Gael <ga...@sgcib.com>>:
Hello Dear,

I made an estimation of a number of nodes of a cluster that can be supplied by 720GB of data/day.
My estimation gave me 367 datanodes in a year. I’m a bit afraid by that amount of datanodes.
The assumptions, I used are the followings :


-          Daily supply (feed) : 720GB

-          HDFS replication factor: 3

-          Booked space for each disk outside HDFS: 30%

-          Size of a disk: 3TB.

I have two questions.

First, I would like to know if my assumptions are well taken?
Secondly, could someone help me to evaluate that cluster, to let me be sure that my results are not to excessive, please ?

Standing by for your feedback

Warm regard

*************************************************************************
This message and any attachments (the "message") are confidential, intended solely for the addressee(s), and may contain legally privileged information.
Any unauthorised use or dissemination is prohibited. E-mails are susceptible to alteration.
Neither SOCIETE GENERALE nor any of its subsidiaries or affiliates shall be liable for the message if altered, changed or
falsified.
Please visit http://swapdisclosure.sgcib.com for important information with respect to derivative products.
                              ************
Ce message et toutes les pieces jointes (ci-apres le "message") sont confidentiels et susceptibles de contenir des informations couvertes
par le secret professionnel.
Ce message est etabli a l'intention exclusive de ses destinataires. Toute utilisation ou diffusion non autorisee est interdite.
Tout message electronique est susceptible d'alteration.
La SOCIETE GENERALE et ses filiales declinent toute responsabilite au titre de ce message s'il a ete altere, deforme ou falsifie.
Veuillez consulter le site http://swapdisclosure.sgcib.com afin de recueillir d'importantes informations sur les produits derives.
*************************************************************************


RE: Need to evaluate a cluster

Posted by YIMEN YIMGA Gael <ga...@sgcib.com>.
Hi,

What does « 1.3 for overhead » mean in this calculation ?

Regards

From: Mirko Kämpf [mailto:mirko.kaempf@gmail.com]
Sent: Wednesday 9 July 2014 18:09
To: user@hadoop.apache.org
Subject: Re: Need to evaluate a cluster

Hello,

if I follow your numbers I see one missing fact: What is the number of HDDs per DataNode?
Let's assume you use machines with 6 x 3TB HDDs per box, you would need about 60 DataNodes
per year (0.75 TB per day x 3 for replication x 1.3 for overhead / ( nr of HDDs per node x capacity per HDD )).
With 12 HDD you would only need 30 servers per year.
How did you calculate the number of 367 datanodes?

Cheers,
Mirko

2014-07-09 17:59 GMT+02:00 YIMEN YIMGA Gael <ga...@sgcib.com>>:
Hello Dear,

I made an estimation of a number of nodes of a cluster that can be supplied by 720GB of data/day.
My estimation gave me 367 datanodes in a year. I’m a bit afraid by that amount of datanodes.
The assumptions, I used are the followings :


-          Daily supply (feed) : 720GB

-          HDFS replication factor: 3

-          Booked space for each disk outside HDFS: 30%

-          Size of a disk: 3TB.

I have two questions.

First, I would like to know if my assumptions are well taken?
Secondly, could someone help me to evaluate that cluster, to let me be sure that my results are not to excessive, please ?

Standing by for your feedback

Warm regard

*************************************************************************
This message and any attachments (the "message") are confidential, intended solely for the addressee(s), and may contain legally privileged information.
Any unauthorised use or dissemination is prohibited. E-mails are susceptible to alteration.
Neither SOCIETE GENERALE nor any of its subsidiaries or affiliates shall be liable for the message if altered, changed or
falsified.
Please visit http://swapdisclosure.sgcib.com for important information with respect to derivative products.
                              ************
Ce message et toutes les pieces jointes (ci-apres le "message") sont confidentiels et susceptibles de contenir des informations couvertes
par le secret professionnel.
Ce message est etabli a l'intention exclusive de ses destinataires. Toute utilisation ou diffusion non autorisee est interdite.
Tout message electronique est susceptible d'alteration.
La SOCIETE GENERALE et ses filiales declinent toute responsabilite au titre de ce message s'il a ete altere, deforme ou falsifie.
Veuillez consulter le site http://swapdisclosure.sgcib.com afin de recueillir d'importantes informations sur les produits derives.
*************************************************************************


Re: Need to evaluate a cluster

Posted by Olivier Renault <or...@hortonworks.com>.
Is your data already compressed? If it's not you can safely assume a
compression ratio of 5.

Olivier
On 9 Jul 2014 17:10, "Mirko Kämpf" <mi...@gmail.com> wrote:

> Hello,
>
> if I follow your numbers I see one missing fact: *What is the number of
> HDDs per DataNode*?
> Let's assume you use machines with 6 x 3TB HDDs per box, you would need
> about 60 DataNodes
> per year (0.75 TB per day x 3 for replication x 1.3 for overhead / ( nr of
> HDDs per node x capacity per HDD )).
> With 12 HDD you would only need 30 servers per year.
> How did you calculate the number of 367 datanodes?
>
> Cheers,
> Mirko
>
>
> 2014-07-09 17:59 GMT+02:00 YIMEN YIMGA Gael <ga...@sgcib.com>:
>
>> Hello Dear,
>>
>>
>>
>> I made an estimation of a number of nodes of a cluster that can be
>> supplied by 720GB of data/day.
>>
>> My estimation gave me *367 datanodes* in a year. I’m a bit afraid by
>> that amount of datanodes.
>>
>> The assumptions, I used are the followings :
>>
>>
>>
>> -          Daily supply (feed) : 720GB
>>
>> -          HDFS replication factor: 3
>>
>> -          Booked space for each disk outside HDFS: 30%
>>
>> -          Size of a disk: 3TB.
>>
>>
>>
>> I have two questions.
>>
>>
>>
>> First, I would like to know if my assumptions are well taken?
>>
>> Secondly, could someone help me to evaluate that cluster, to let me be
>> sure that my results are not to excessive, please ?
>>
>>
>>
>> Standing by for your feedback
>>
>>
>>
>> Warm regard
>>
>> *************************************************************************
>> This message and any attachments (the "message") are confidential,
>> intended solely for the addressee(s), and may contain legally privileged
>> information.
>> Any unauthorised use or dissemination is prohibited. E-mails are
>> susceptible to alteration.
>> Neither SOCIETE GENERALE nor any of its subsidiaries or affiliates shall
>> be liable for the message if altered, changed or
>> falsified.
>> Please visit http://swapdisclosure.sgcib.com for important information
>> with respect to derivative products.
>>                               ************
>> Ce message et toutes les pieces jointes (ci-apres le "message") sont
>> confidentiels et susceptibles de contenir des informations couvertes
>> par le secret professionnel.
>> Ce message est etabli a l'intention exclusive de ses destinataires. Toute
>> utilisation ou diffusion non autorisee est interdite.
>> Tout message electronique est susceptible d'alteration.
>> La SOCIETE GENERALE et ses filiales declinent toute responsabilite au
>> titre de ce message s'il a ete altere, deforme ou falsifie.
>> Veuillez consulter le site http://swapdisclosure.sgcib.com afin de
>> recueillir d'importantes informations sur les produits derives.
>> *************************************************************************
>>
>
>

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

RE: Need to evaluate a cluster

Posted by YIMEN YIMGA Gael <ga...@sgcib.com>.
Hi,

What does « 1.3 for overhead » mean in this calculation ?

Regards

From: Mirko Kämpf [mailto:mirko.kaempf@gmail.com]
Sent: Wednesday 9 July 2014 18:09
To: user@hadoop.apache.org
Subject: Re: Need to evaluate a cluster

Hello,

if I follow your numbers I see one missing fact: What is the number of HDDs per DataNode?
Let's assume you use machines with 6 x 3TB HDDs per box, you would need about 60 DataNodes
per year (0.75 TB per day x 3 for replication x 1.3 for overhead / ( nr of HDDs per node x capacity per HDD )).
With 12 HDD you would only need 30 servers per year.
How did you calculate the number of 367 datanodes?

Cheers,
Mirko

2014-07-09 17:59 GMT+02:00 YIMEN YIMGA Gael <ga...@sgcib.com>>:
Hello Dear,

I made an estimation of a number of nodes of a cluster that can be supplied by 720GB of data/day.
My estimation gave me 367 datanodes in a year. I’m a bit afraid by that amount of datanodes.
The assumptions, I used are the followings :


-          Daily supply (feed) : 720GB

-          HDFS replication factor: 3

-          Booked space for each disk outside HDFS: 30%

-          Size of a disk: 3TB.

I have two questions.

First, I would like to know if my assumptions are well taken?
Secondly, could someone help me to evaluate that cluster, to let me be sure that my results are not to excessive, please ?

Standing by for your feedback

Warm regard

*************************************************************************
This message and any attachments (the "message") are confidential, intended solely for the addressee(s), and may contain legally privileged information.
Any unauthorised use or dissemination is prohibited. E-mails are susceptible to alteration.
Neither SOCIETE GENERALE nor any of its subsidiaries or affiliates shall be liable for the message if altered, changed or
falsified.
Please visit http://swapdisclosure.sgcib.com for important information with respect to derivative products.
                              ************
Ce message et toutes les pieces jointes (ci-apres le "message") sont confidentiels et susceptibles de contenir des informations couvertes
par le secret professionnel.
Ce message est etabli a l'intention exclusive de ses destinataires. Toute utilisation ou diffusion non autorisee est interdite.
Tout message electronique est susceptible d'alteration.
La SOCIETE GENERALE et ses filiales declinent toute responsabilite au titre de ce message s'il a ete altere, deforme ou falsifie.
Veuillez consulter le site http://swapdisclosure.sgcib.com afin de recueillir d'importantes informations sur les produits derives.
*************************************************************************


RE: Need to evaluate a cluster

Posted by YIMEN YIMGA Gael <ga...@sgcib.com>.
Hi,

What does « 1.3 for overhead » mean in this calculation ?

Regards

From: Mirko Kämpf [mailto:mirko.kaempf@gmail.com]
Sent: Wednesday 9 July 2014 18:09
To: user@hadoop.apache.org
Subject: Re: Need to evaluate a cluster

Hello,

if I follow your numbers I see one missing fact: What is the number of HDDs per DataNode?
Let's assume you use machines with 6 x 3TB HDDs per box, you would need about 60 DataNodes
per year (0.75 TB per day x 3 for replication x 1.3 for overhead / ( nr of HDDs per node x capacity per HDD )).
With 12 HDD you would only need 30 servers per year.
How did you calculate the number of 367 datanodes?

Cheers,
Mirko

2014-07-09 17:59 GMT+02:00 YIMEN YIMGA Gael <ga...@sgcib.com>>:
Hello Dear,

I made an estimation of a number of nodes of a cluster that can be supplied by 720GB of data/day.
My estimation gave me 367 datanodes in a year. I’m a bit afraid by that amount of datanodes.
The assumptions, I used are the followings :


-          Daily supply (feed) : 720GB

-          HDFS replication factor: 3

-          Booked space for each disk outside HDFS: 30%

-          Size of a disk: 3TB.

I have two questions.

First, I would like to know if my assumptions are well taken?
Secondly, could someone help me to evaluate that cluster, to let me be sure that my results are not to excessive, please ?

Standing by for your feedback

Warm regard

*************************************************************************
This message and any attachments (the "message") are confidential, intended solely for the addressee(s), and may contain legally privileged information.
Any unauthorised use or dissemination is prohibited. E-mails are susceptible to alteration.
Neither SOCIETE GENERALE nor any of its subsidiaries or affiliates shall be liable for the message if altered, changed or
falsified.
Please visit http://swapdisclosure.sgcib.com for important information with respect to derivative products.
                              ************
Ce message et toutes les pieces jointes (ci-apres le "message") sont confidentiels et susceptibles de contenir des informations couvertes
par le secret professionnel.
Ce message est etabli a l'intention exclusive de ses destinataires. Toute utilisation ou diffusion non autorisee est interdite.
Tout message electronique est susceptible d'alteration.
La SOCIETE GENERALE et ses filiales declinent toute responsabilite au titre de ce message s'il a ete altere, deforme ou falsifie.
Veuillez consulter le site http://swapdisclosure.sgcib.com afin de recueillir d'importantes informations sur les produits derives.
*************************************************************************


Re: Need to evaluate a cluster

Posted by Olivier Renault <or...@hortonworks.com>.
Is your data already compressed? If it's not you can safely assume a
compression ratio of 5.

Olivier
On 9 Jul 2014 17:10, "Mirko Kämpf" <mi...@gmail.com> wrote:

> Hello,
>
> if I follow your numbers I see one missing fact: *What is the number of
> HDDs per DataNode*?
> Let's assume you use machines with 6 x 3TB HDDs per box, you would need
> about 60 DataNodes
> per year (0.75 TB per day x 3 for replication x 1.3 for overhead / ( nr of
> HDDs per node x capacity per HDD )).
> With 12 HDD you would only need 30 servers per year.
> How did you calculate the number of 367 datanodes?
>
> Cheers,
> Mirko
>
>
> 2014-07-09 17:59 GMT+02:00 YIMEN YIMGA Gael <ga...@sgcib.com>:
>
>> Hello Dear,
>>
>>
>>
>> I made an estimation of a number of nodes of a cluster that can be
>> supplied by 720GB of data/day.
>>
>> My estimation gave me *367 datanodes* in a year. I’m a bit afraid by
>> that amount of datanodes.
>>
>> The assumptions, I used are the followings :
>>
>>
>>
>> -          Daily supply (feed) : 720GB
>>
>> -          HDFS replication factor: 3
>>
>> -          Booked space for each disk outside HDFS: 30%
>>
>> -          Size of a disk: 3TB.
>>
>>
>>
>> I have two questions.
>>
>>
>>
>> First, I would like to know if my assumptions are well taken?
>>
>> Secondly, could someone help me to evaluate that cluster, to let me be
>> sure that my results are not to excessive, please ?
>>
>>
>>
>> Standing by for your feedback
>>
>>
>>
>> Warm regard
>>
>> *************************************************************************
>> This message and any attachments (the "message") are confidential,
>> intended solely for the addressee(s), and may contain legally privileged
>> information.
>> Any unauthorised use or dissemination is prohibited. E-mails are
>> susceptible to alteration.
>> Neither SOCIETE GENERALE nor any of its subsidiaries or affiliates shall
>> be liable for the message if altered, changed or
>> falsified.
>> Please visit http://swapdisclosure.sgcib.com for important information
>> with respect to derivative products.
>>                               ************
>> Ce message et toutes les pieces jointes (ci-apres le "message") sont
>> confidentiels et susceptibles de contenir des informations couvertes
>> par le secret professionnel.
>> Ce message est etabli a l'intention exclusive de ses destinataires. Toute
>> utilisation ou diffusion non autorisee est interdite.
>> Tout message electronique est susceptible d'alteration.
>> La SOCIETE GENERALE et ses filiales declinent toute responsabilite au
>> titre de ce message s'il a ete altere, deforme ou falsifie.
>> Veuillez consulter le site http://swapdisclosure.sgcib.com afin de
>> recueillir d'importantes informations sur les produits derives.
>> *************************************************************************
>>
>
>

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: Need to evaluate a cluster

Posted by Olivier Renault <or...@hortonworks.com>.
Is your data already compressed? If it's not you can safely assume a
compression ratio of 5.

Olivier
On 9 Jul 2014 17:10, "Mirko Kämpf" <mi...@gmail.com> wrote:

> Hello,
>
> if I follow your numbers I see one missing fact: *What is the number of
> HDDs per DataNode*?
> Let's assume you use machines with 6 x 3TB HDDs per box, you would need
> about 60 DataNodes
> per year (0.75 TB per day x 3 for replication x 1.3 for overhead / ( nr of
> HDDs per node x capacity per HDD )).
> With 12 HDD you would only need 30 servers per year.
> How did you calculate the number of 367 datanodes?
>
> Cheers,
> Mirko
>
>
> 2014-07-09 17:59 GMT+02:00 YIMEN YIMGA Gael <ga...@sgcib.com>:
>
>> Hello Dear,
>>
>>
>>
>> I made an estimation of a number of nodes of a cluster that can be
>> supplied by 720GB of data/day.
>>
>> My estimation gave me *367 datanodes* in a year. I’m a bit afraid by
>> that amount of datanodes.
>>
>> The assumptions, I used are the followings :
>>
>>
>>
>> -          Daily supply (feed) : 720GB
>>
>> -          HDFS replication factor: 3
>>
>> -          Booked space for each disk outside HDFS: 30%
>>
>> -          Size of a disk: 3TB.
>>
>>
>>
>> I have two questions.
>>
>>
>>
>> First, I would like to know if my assumptions are well taken?
>>
>> Secondly, could someone help me to evaluate that cluster, to let me be
>> sure that my results are not to excessive, please ?
>>
>>
>>
>> Standing by for your feedback
>>
>>
>>
>> Warm regard
>>
>> *************************************************************************
>> This message and any attachments (the "message") are confidential,
>> intended solely for the addressee(s), and may contain legally privileged
>> information.
>> Any unauthorised use or dissemination is prohibited. E-mails are
>> susceptible to alteration.
>> Neither SOCIETE GENERALE nor any of its subsidiaries or affiliates shall
>> be liable for the message if altered, changed or
>> falsified.
>> Please visit http://swapdisclosure.sgcib.com for important information
>> with respect to derivative products.
>>                               ************
>> Ce message et toutes les pieces jointes (ci-apres le "message") sont
>> confidentiels et susceptibles de contenir des informations couvertes
>> par le secret professionnel.
>> Ce message est etabli a l'intention exclusive de ses destinataires. Toute
>> utilisation ou diffusion non autorisee est interdite.
>> Tout message electronique est susceptible d'alteration.
>> La SOCIETE GENERALE et ses filiales declinent toute responsabilite au
>> titre de ce message s'il a ete altere, deforme ou falsifie.
>> Veuillez consulter le site http://swapdisclosure.sgcib.com afin de
>> recueillir d'importantes informations sur les produits derives.
>> *************************************************************************
>>
>
>

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: Need to evaluate a cluster

Posted by Mirko Kämpf <mi...@gmail.com>.
Hello,

if I follow your numbers I see one missing fact: *What is the number of
HDDs per DataNode*?
Let's assume you use machines with 6 x 3TB HDDs per box, you would need
about 60 DataNodes
per year (0.75 TB per day x 3 for replication x 1.3 for overhead / ( nr of
HDDs per node x capacity per HDD )).
With 12 HDD you would only need 30 servers per year.
How did you calculate the number of 367 datanodes?

Cheers,
Mirko


2014-07-09 17:59 GMT+02:00 YIMEN YIMGA Gael <ga...@sgcib.com>:

> Hello Dear,
>
>
>
> I made an estimation of a number of nodes of a cluster that can be
> supplied by 720GB of data/day.
>
> My estimation gave me *367 datanodes* in a year. I’m a bit afraid by that
> amount of datanodes.
>
> The assumptions, I used are the followings :
>
>
>
> -          Daily supply (feed) : 720GB
>
> -          HDFS replication factor: 3
>
> -          Booked space for each disk outside HDFS: 30%
>
> -          Size of a disk: 3TB.
>
>
>
> I have two questions.
>
>
>
> First, I would like to know if my assumptions are well taken?
>
> Secondly, could someone help me to evaluate that cluster, to let me be
> sure that my results are not to excessive, please ?
>
>
>
> Standing by for your feedback
>
>
>
> Warm regard
>
> *************************************************************************
> This message and any attachments (the "message") are confidential,
> intended solely for the addressee(s), and may contain legally privileged
> information.
> Any unauthorised use or dissemination is prohibited. E-mails are
> susceptible to alteration.
> Neither SOCIETE GENERALE nor any of its subsidiaries or affiliates shall
> be liable for the message if altered, changed or
> falsified.
> Please visit http://swapdisclosure.sgcib.com for important information
> with respect to derivative products.
>                               ************
> Ce message et toutes les pieces jointes (ci-apres le "message") sont
> confidentiels et susceptibles de contenir des informations couvertes
> par le secret professionnel.
> Ce message est etabli a l'intention exclusive de ses destinataires. Toute
> utilisation ou diffusion non autorisee est interdite.
> Tout message electronique est susceptible d'alteration.
> La SOCIETE GENERALE et ses filiales declinent toute responsabilite au
> titre de ce message s'il a ete altere, deforme ou falsifie.
> Veuillez consulter le site http://swapdisclosure.sgcib.com afin de
> recueillir d'importantes informations sur les produits derives.
> *************************************************************************
>

Re: Need to evaluate a cluster

Posted by Mirko Kämpf <mi...@gmail.com>.
Hello,

if I follow your numbers I see one missing fact: *What is the number of
HDDs per DataNode*?
Let's assume you use machines with 6 x 3TB HDDs per box, you would need
about 60 DataNodes
per year (0.75 TB per day x 3 for replication x 1.3 for overhead / ( nr of
HDDs per node x capacity per HDD )).
With 12 HDD you would only need 30 servers per year.
How did you calculate the number of 367 datanodes?

Cheers,
Mirko


2014-07-09 17:59 GMT+02:00 YIMEN YIMGA Gael <ga...@sgcib.com>:

> Hello Dear,
>
>
>
> I made an estimation of a number of nodes of a cluster that can be
> supplied by 720GB of data/day.
>
> My estimation gave me *367 datanodes* in a year. I’m a bit afraid by that
> amount of datanodes.
>
> The assumptions, I used are the followings :
>
>
>
> -          Daily supply (feed) : 720GB
>
> -          HDFS replication factor: 3
>
> -          Booked space for each disk outside HDFS: 30%
>
> -          Size of a disk: 3TB.
>
>
>
> I have two questions.
>
>
>
> First, I would like to know if my assumptions are well taken?
>
> Secondly, could someone help me to evaluate that cluster, to let me be
> sure that my results are not to excessive, please ?
>
>
>
> Standing by for your feedback
>
>
>
> Warm regard
>
> *************************************************************************
> This message and any attachments (the "message") are confidential,
> intended solely for the addressee(s), and may contain legally privileged
> information.
> Any unauthorised use or dissemination is prohibited. E-mails are
> susceptible to alteration.
> Neither SOCIETE GENERALE nor any of its subsidiaries or affiliates shall
> be liable for the message if altered, changed or
> falsified.
> Please visit http://swapdisclosure.sgcib.com for important information
> with respect to derivative products.
>                               ************
> Ce message et toutes les pieces jointes (ci-apres le "message") sont
> confidentiels et susceptibles de contenir des informations couvertes
> par le secret professionnel.
> Ce message est etabli a l'intention exclusive de ses destinataires. Toute
> utilisation ou diffusion non autorisee est interdite.
> Tout message electronique est susceptible d'alteration.
> La SOCIETE GENERALE et ses filiales declinent toute responsabilite au
> titre de ce message s'il a ete altere, deforme ou falsifie.
> Veuillez consulter le site http://swapdisclosure.sgcib.com afin de
> recueillir d'importantes informations sur les produits derives.
> *************************************************************************
>

RE: Need to evaluate a cluster

Posted by YIMEN YIMGA Gael <ga...@sgcib.com>.
Hi,

When I said Size of a disk is 3TB, this means that a datanode should have a disk space of 3TB (1 to 3 disk of 1TB to 3TB).

Could you please help me with your experience to approximate the number of nodes on one year ?

Regards

From: Oner Ak. [mailto:oak26013@gmail.com]
Sent: Wednesday 9 July 2014 21:31
To: user@hadoop.apache.org
Subject: Re: Need to evaluate a cluster


367 nodes sounded quite high for that amount of data per day. You might need 367 disks, but do your nodes have more than one disk?

You may also take into account the compression factor that you are likely to use for the data on the cluster.

Oner
9 Tem 2014 19:00 tarihinde "YIMEN YIMGA Gael" <ga...@sgcib.com>> yazdı:
Hello Dear,

I made an estimation of a number of nodes of a cluster that can be supplied by 720GB of data/day.
My estimation gave me 367 datanodes in a year. I’m a bit afraid by that amount of datanodes.
The assumptions, I used are the followings :


-          Daily supply (feed) : 720GB

-          HDFS replication factor: 3

-          Booked space for each disk outside HDFS: 30%

-          Size of a disk: 3TB.

I have two questions.

First, I would like to know if my assumptions are well taken?
Secondly, could someone help me to evaluate that cluster, to let me be sure that my results are not to excessive, please ?

Standing by for your feedback

Warm regard

*************************************************************************
This message and any attachments (the "message") are confidential, intended solely for the addressee(s), and may contain legally privileged information.
Any unauthorised use or dissemination is prohibited. E-mails are susceptible to alteration.
Neither SOCIETE GENERALE nor any of its subsidiaries or affiliates shall be liable for the message if altered, changed or
falsified.
Please visit http://swapdisclosure.sgcib.com for important information with respect to derivative products.
                              ************
Ce message et toutes les pieces jointes (ci-apres le "message") sont confidentiels et susceptibles de contenir des informations couvertes
par le secret professionnel.
Ce message est etabli a l'intention exclusive de ses destinataires. Toute utilisation ou diffusion non autorisee est interdite.
Tout message electronique est susceptible d'alteration.
La SOCIETE GENERALE et ses filiales declinent toute responsabilite au titre de ce message s'il a ete altere, deforme ou falsifie.
Veuillez consulter le site http://swapdisclosure.sgcib.com afin de recueillir d'importantes informations sur les produits derives.
*************************************************************************

RE: Need to evaluate a cluster

Posted by YIMEN YIMGA Gael <ga...@sgcib.com>.
Hi,

When I said Size of a disk is 3TB, this means that a datanode should have a disk space of 3TB (1 to 3 disk of 1TB to 3TB).

Could you please help me with your experience to approximate the number of nodes on one year ?

Regards

From: Oner Ak. [mailto:oak26013@gmail.com]
Sent: Wednesday 9 July 2014 21:31
To: user@hadoop.apache.org
Subject: Re: Need to evaluate a cluster


367 nodes sounded quite high for that amount of data per day. You might need 367 disks, but do your nodes have more than one disk?

You may also take into account the compression factor that you are likely to use for the data on the cluster.

Oner
9 Tem 2014 19:00 tarihinde "YIMEN YIMGA Gael" <ga...@sgcib.com>> yazdı:
Hello Dear,

I made an estimation of a number of nodes of a cluster that can be supplied by 720GB of data/day.
My estimation gave me 367 datanodes in a year. I’m a bit afraid by that amount of datanodes.
The assumptions, I used are the followings :


-          Daily supply (feed) : 720GB

-          HDFS replication factor: 3

-          Booked space for each disk outside HDFS: 30%

-          Size of a disk: 3TB.

I have two questions.

First, I would like to know if my assumptions are well taken?
Secondly, could someone help me to evaluate that cluster, to let me be sure that my results are not to excessive, please ?

Standing by for your feedback

Warm regard

*************************************************************************
This message and any attachments (the "message") are confidential, intended solely for the addressee(s), and may contain legally privileged information.
Any unauthorised use or dissemination is prohibited. E-mails are susceptible to alteration.
Neither SOCIETE GENERALE nor any of its subsidiaries or affiliates shall be liable for the message if altered, changed or
falsified.
Please visit http://swapdisclosure.sgcib.com for important information with respect to derivative products.
                              ************
Ce message et toutes les pieces jointes (ci-apres le "message") sont confidentiels et susceptibles de contenir des informations couvertes
par le secret professionnel.
Ce message est etabli a l'intention exclusive de ses destinataires. Toute utilisation ou diffusion non autorisee est interdite.
Tout message electronique est susceptible d'alteration.
La SOCIETE GENERALE et ses filiales declinent toute responsabilite au titre de ce message s'il a ete altere, deforme ou falsifie.
Veuillez consulter le site http://swapdisclosure.sgcib.com afin de recueillir d'importantes informations sur les produits derives.
*************************************************************************

RE: Need to evaluate a cluster

Posted by YIMEN YIMGA Gael <ga...@sgcib.com>.
Hi,

When I said Size of a disk is 3TB, this means that a datanode should have a disk space of 3TB (1 to 3 disk of 1TB to 3TB).

Could you please help me with your experience to approximate the number of nodes on one year ?

Regards

From: Oner Ak. [mailto:oak26013@gmail.com]
Sent: Wednesday 9 July 2014 21:31
To: user@hadoop.apache.org
Subject: Re: Need to evaluate a cluster


367 nodes sounded quite high for that amount of data per day. You might need 367 disks, but do your nodes have more than one disk?

You may also take into account the compression factor that you are likely to use for the data on the cluster.

Oner
9 Tem 2014 19:00 tarihinde "YIMEN YIMGA Gael" <ga...@sgcib.com>> yazdı:
Hello Dear,

I made an estimation of a number of nodes of a cluster that can be supplied by 720GB of data/day.
My estimation gave me 367 datanodes in a year. I’m a bit afraid by that amount of datanodes.
The assumptions, I used are the followings :


-          Daily supply (feed) : 720GB

-          HDFS replication factor: 3

-          Booked space for each disk outside HDFS: 30%

-          Size of a disk: 3TB.

I have two questions.

First, I would like to know if my assumptions are well taken?
Secondly, could someone help me to evaluate that cluster, to let me be sure that my results are not to excessive, please ?

Standing by for your feedback

Warm regard

*************************************************************************
This message and any attachments (the "message") are confidential, intended solely for the addressee(s), and may contain legally privileged information.
Any unauthorised use or dissemination is prohibited. E-mails are susceptible to alteration.
Neither SOCIETE GENERALE nor any of its subsidiaries or affiliates shall be liable for the message if altered, changed or
falsified.
Please visit http://swapdisclosure.sgcib.com for important information with respect to derivative products.
                              ************
Ce message et toutes les pieces jointes (ci-apres le "message") sont confidentiels et susceptibles de contenir des informations couvertes
par le secret professionnel.
Ce message est etabli a l'intention exclusive de ses destinataires. Toute utilisation ou diffusion non autorisee est interdite.
Tout message electronique est susceptible d'alteration.
La SOCIETE GENERALE et ses filiales declinent toute responsabilite au titre de ce message s'il a ete altere, deforme ou falsifie.
Veuillez consulter le site http://swapdisclosure.sgcib.com afin de recueillir d'importantes informations sur les produits derives.
*************************************************************************

RE: Need to evaluate a cluster

Posted by YIMEN YIMGA Gael <ga...@sgcib.com>.
Hi,

When I said Size of a disk is 3TB, this means that a datanode should have a disk space of 3TB (1 to 3 disk of 1TB to 3TB).

Could you please help me with your experience to approximate the number of nodes on one year ?

Regards

From: Oner Ak. [mailto:oak26013@gmail.com]
Sent: Wednesday 9 July 2014 21:31
To: user@hadoop.apache.org
Subject: Re: Need to evaluate a cluster


367 nodes sounded quite high for that amount of data per day. You might need 367 disks, but do your nodes have more than one disk?

You may also take into account the compression factor that you are likely to use for the data on the cluster.

Oner
9 Tem 2014 19:00 tarihinde "YIMEN YIMGA Gael" <ga...@sgcib.com>> yazdı:
Hello Dear,

I made an estimation of a number of nodes of a cluster that can be supplied by 720GB of data/day.
My estimation gave me 367 datanodes in a year. I’m a bit afraid by that amount of datanodes.
The assumptions, I used are the followings :


-          Daily supply (feed) : 720GB

-          HDFS replication factor: 3

-          Booked space for each disk outside HDFS: 30%

-          Size of a disk: 3TB.

I have two questions.

First, I would like to know if my assumptions are well taken?
Secondly, could someone help me to evaluate that cluster, to let me be sure that my results are not to excessive, please ?

Standing by for your feedback

Warm regard

*************************************************************************
This message and any attachments (the "message") are confidential, intended solely for the addressee(s), and may contain legally privileged information.
Any unauthorised use or dissemination is prohibited. E-mails are susceptible to alteration.
Neither SOCIETE GENERALE nor any of its subsidiaries or affiliates shall be liable for the message if altered, changed or
falsified.
Please visit http://swapdisclosure.sgcib.com for important information with respect to derivative products.
                              ************
Ce message et toutes les pieces jointes (ci-apres le "message") sont confidentiels et susceptibles de contenir des informations couvertes
par le secret professionnel.
Ce message est etabli a l'intention exclusive de ses destinataires. Toute utilisation ou diffusion non autorisee est interdite.
Tout message electronique est susceptible d'alteration.
La SOCIETE GENERALE et ses filiales declinent toute responsabilite au titre de ce message s'il a ete altere, deforme ou falsifie.
Veuillez consulter le site http://swapdisclosure.sgcib.com afin de recueillir d'importantes informations sur les produits derives.
*************************************************************************

Re: Need to evaluate a cluster

Posted by "Oner Ak." <oa...@gmail.com>.
367 nodes sounded quite high for that amount of data per day. You might
need 367 disks, but do your nodes have more than one disk?

You may also take into account the compression factor that you are likely
to use for the data on the cluster.

Oner
9 Tem 2014 19:00 tarihinde "YIMEN YIMGA Gael" <ga...@sgcib.com>
yazdı:

> Hello Dear,
>
>
>
> I made an estimation of a number of nodes of a cluster that can be
> supplied by 720GB of data/day.
>
> My estimation gave me *367 datanodes* in a year. I’m a bit afraid by that
> amount of datanodes.
>
> The assumptions, I used are the followings :
>
>
>
> -          Daily supply (feed) : 720GB
>
> -          HDFS replication factor: 3
>
> -          Booked space for each disk outside HDFS: 30%
>
> -          Size of a disk: 3TB.
>
>
>
> I have two questions.
>
>
>
> First, I would like to know if my assumptions are well taken?
>
> Secondly, could someone help me to evaluate that cluster, to let me be
> sure that my results are not to excessive, please ?
>
>
>
> Standing by for your feedback
>
>
>
> Warm regard
>
> *************************************************************************
> This message and any attachments (the "message") are confidential,
> intended solely for the addressee(s), and may contain legally privileged
> information.
> Any unauthorised use or dissemination is prohibited. E-mails are
> susceptible to alteration.
> Neither SOCIETE GENERALE nor any of its subsidiaries or affiliates shall
> be liable for the message if altered, changed or
> falsified.
> Please visit http://swapdisclosure.sgcib.com for important information
> with respect to derivative products.
>                               ************
> Ce message et toutes les pieces jointes (ci-apres le "message") sont
> confidentiels et susceptibles de contenir des informations couvertes
> par le secret professionnel.
> Ce message est etabli a l'intention exclusive de ses destinataires. Toute
> utilisation ou diffusion non autorisee est interdite.
> Tout message electronique est susceptible d'alteration.
> La SOCIETE GENERALE et ses filiales declinent toute responsabilite au
> titre de ce message s'il a ete altere, deforme ou falsifie.
> Veuillez consulter le site http://swapdisclosure.sgcib.com afin de
> recueillir d'importantes informations sur les produits derives.
> *************************************************************************
>

Re: Need to evaluate a cluster

Posted by Mirko Kämpf <mi...@gmail.com>.
Hello,

if I follow your numbers I see one missing fact: *What is the number of
HDDs per DataNode*?
Let's assume you use machines with 6 x 3TB HDDs per box, you would need
about 60 DataNodes
per year (0.75 TB per day x 3 for replication x 1.3 for overhead / ( nr of
HDDs per node x capacity per HDD )).
With 12 HDD you would only need 30 servers per year.
How did you calculate the number of 367 datanodes?

Cheers,
Mirko


2014-07-09 17:59 GMT+02:00 YIMEN YIMGA Gael <ga...@sgcib.com>:

> Hello Dear,
>
>
>
> I made an estimation of a number of nodes of a cluster that can be
> supplied by 720GB of data/day.
>
> My estimation gave me *367 datanodes* in a year. I’m a bit afraid by that
> amount of datanodes.
>
> The assumptions, I used are the followings :
>
>
>
> -          Daily supply (feed) : 720GB
>
> -          HDFS replication factor: 3
>
> -          Booked space for each disk outside HDFS: 30%
>
> -          Size of a disk: 3TB.
>
>
>
> I have two questions.
>
>
>
> First, I would like to know if my assumptions are well taken?
>
> Secondly, could someone help me to evaluate that cluster, to let me be
> sure that my results are not to excessive, please ?
>
>
>
> Standing by for your feedback
>
>
>
> Warm regard
>
> *************************************************************************
> This message and any attachments (the "message") are confidential,
> intended solely for the addressee(s), and may contain legally privileged
> information.
> Any unauthorised use or dissemination is prohibited. E-mails are
> susceptible to alteration.
> Neither SOCIETE GENERALE nor any of its subsidiaries or affiliates shall
> be liable for the message if altered, changed or
> falsified.
> Please visit http://swapdisclosure.sgcib.com for important information
> with respect to derivative products.
>                               ************
> Ce message et toutes les pieces jointes (ci-apres le "message") sont
> confidentiels et susceptibles de contenir des informations couvertes
> par le secret professionnel.
> Ce message est etabli a l'intention exclusive de ses destinataires. Toute
> utilisation ou diffusion non autorisee est interdite.
> Tout message electronique est susceptible d'alteration.
> La SOCIETE GENERALE et ses filiales declinent toute responsabilite au
> titre de ce message s'il a ete altere, deforme ou falsifie.
> Veuillez consulter le site http://swapdisclosure.sgcib.com afin de
> recueillir d'importantes informations sur les produits derives.
> *************************************************************************
>

Re: Need to evaluate a cluster

Posted by Mirko Kämpf <mi...@gmail.com>.
Hello,

if I follow your numbers I see one missing fact: *What is the number of
HDDs per DataNode*?
Let's assume you use machines with 6 x 3TB HDDs per box, you would need
about 60 DataNodes
per year (0.75 TB per day x 3 for replication x 1.3 for overhead / ( nr of
HDDs per node x capacity per HDD )).
With 12 HDD you would only need 30 servers per year.
How did you calculate the number of 367 datanodes?

Cheers,
Mirko


2014-07-09 17:59 GMT+02:00 YIMEN YIMGA Gael <ga...@sgcib.com>:

> Hello Dear,
>
>
>
> I made an estimation of a number of nodes of a cluster that can be
> supplied by 720GB of data/day.
>
> My estimation gave me *367 datanodes* in a year. I’m a bit afraid by that
> amount of datanodes.
>
> The assumptions, I used are the followings :
>
>
>
> -          Daily supply (feed) : 720GB
>
> -          HDFS replication factor: 3
>
> -          Booked space for each disk outside HDFS: 30%
>
> -          Size of a disk: 3TB.
>
>
>
> I have two questions.
>
>
>
> First, I would like to know if my assumptions are well taken?
>
> Secondly, could someone help me to evaluate that cluster, to let me be
> sure that my results are not to excessive, please ?
>
>
>
> Standing by for your feedback
>
>
>
> Warm regard
>
> *************************************************************************
> This message and any attachments (the "message") are confidential,
> intended solely for the addressee(s), and may contain legally privileged
> information.
> Any unauthorised use or dissemination is prohibited. E-mails are
> susceptible to alteration.
> Neither SOCIETE GENERALE nor any of its subsidiaries or affiliates shall
> be liable for the message if altered, changed or
> falsified.
> Please visit http://swapdisclosure.sgcib.com for important information
> with respect to derivative products.
>                               ************
> Ce message et toutes les pieces jointes (ci-apres le "message") sont
> confidentiels et susceptibles de contenir des informations couvertes
> par le secret professionnel.
> Ce message est etabli a l'intention exclusive de ses destinataires. Toute
> utilisation ou diffusion non autorisee est interdite.
> Tout message electronique est susceptible d'alteration.
> La SOCIETE GENERALE et ses filiales declinent toute responsabilite au
> titre de ce message s'il a ete altere, deforme ou falsifie.
> Veuillez consulter le site http://swapdisclosure.sgcib.com afin de
> recueillir d'importantes informations sur les produits derives.
> *************************************************************************
>

Re: Need to evaluate a cluster

Posted by "Oner Ak." <oa...@gmail.com>.
367 nodes sounded quite high for that amount of data per day. You might
need 367 disks, but do your nodes have more than one disk?

You may also take into account the compression factor that you are likely
to use for the data on the cluster.

Oner
9 Tem 2014 19:00 tarihinde "YIMEN YIMGA Gael" <ga...@sgcib.com>
yazdı:

> Hello Dear,
>
>
>
> I made an estimation of a number of nodes of a cluster that can be
> supplied by 720GB of data/day.
>
> My estimation gave me *367 datanodes* in a year. I’m a bit afraid by that
> amount of datanodes.
>
> The assumptions, I used are the followings :
>
>
>
> -          Daily supply (feed) : 720GB
>
> -          HDFS replication factor: 3
>
> -          Booked space for each disk outside HDFS: 30%
>
> -          Size of a disk: 3TB.
>
>
>
> I have two questions.
>
>
>
> First, I would like to know if my assumptions are well taken?
>
> Secondly, could someone help me to evaluate that cluster, to let me be
> sure that my results are not to excessive, please ?
>
>
>
> Standing by for your feedback
>
>
>
> Warm regard
>
> *************************************************************************
> This message and any attachments (the "message") are confidential,
> intended solely for the addressee(s), and may contain legally privileged
> information.
> Any unauthorised use or dissemination is prohibited. E-mails are
> susceptible to alteration.
> Neither SOCIETE GENERALE nor any of its subsidiaries or affiliates shall
> be liable for the message if altered, changed or
> falsified.
> Please visit http://swapdisclosure.sgcib.com for important information
> with respect to derivative products.
>                               ************
> Ce message et toutes les pieces jointes (ci-apres le "message") sont
> confidentiels et susceptibles de contenir des informations couvertes
> par le secret professionnel.
> Ce message est etabli a l'intention exclusive de ses destinataires. Toute
> utilisation ou diffusion non autorisee est interdite.
> Tout message electronique est susceptible d'alteration.
> La SOCIETE GENERALE et ses filiales declinent toute responsabilite au
> titre de ce message s'il a ete altere, deforme ou falsifie.
> Veuillez consulter le site http://swapdisclosure.sgcib.com afin de
> recueillir d'importantes informations sur les produits derives.
> *************************************************************************
>

Re: Need to evaluate a cluster

Posted by "Oner Ak." <oa...@gmail.com>.
367 nodes sounded quite high for that amount of data per day. You might
need 367 disks, but do your nodes have more than one disk?

You may also take into account the compression factor that you are likely
to use for the data on the cluster.

Oner
9 Tem 2014 19:00 tarihinde "YIMEN YIMGA Gael" <ga...@sgcib.com>
yazdı:

> Hello Dear,
>
>
>
> I made an estimation of a number of nodes of a cluster that can be
> supplied by 720GB of data/day.
>
> My estimation gave me *367 datanodes* in a year. I’m a bit afraid by that
> amount of datanodes.
>
> The assumptions, I used are the followings :
>
>
>
> -          Daily supply (feed) : 720GB
>
> -          HDFS replication factor: 3
>
> -          Booked space for each disk outside HDFS: 30%
>
> -          Size of a disk: 3TB.
>
>
>
> I have two questions.
>
>
>
> First, I would like to know if my assumptions are well taken?
>
> Secondly, could someone help me to evaluate that cluster, to let me be
> sure that my results are not to excessive, please ?
>
>
>
> Standing by for your feedback
>
>
>
> Warm regard
>
> *************************************************************************
> This message and any attachments (the "message") are confidential,
> intended solely for the addressee(s), and may contain legally privileged
> information.
> Any unauthorised use or dissemination is prohibited. E-mails are
> susceptible to alteration.
> Neither SOCIETE GENERALE nor any of its subsidiaries or affiliates shall
> be liable for the message if altered, changed or
> falsified.
> Please visit http://swapdisclosure.sgcib.com for important information
> with respect to derivative products.
>                               ************
> Ce message et toutes les pieces jointes (ci-apres le "message") sont
> confidentiels et susceptibles de contenir des informations couvertes
> par le secret professionnel.
> Ce message est etabli a l'intention exclusive de ses destinataires. Toute
> utilisation ou diffusion non autorisee est interdite.
> Tout message electronique est susceptible d'alteration.
> La SOCIETE GENERALE et ses filiales declinent toute responsabilite au
> titre de ce message s'il a ete altere, deforme ou falsifie.
> Veuillez consulter le site http://swapdisclosure.sgcib.com afin de
> recueillir d'importantes informations sur les produits derives.
> *************************************************************************
>

Re: Need to evaluate a cluster

Posted by "Oner Ak." <oa...@gmail.com>.
367 nodes sounded quite high for that amount of data per day. You might
need 367 disks, but do your nodes have more than one disk?

You may also take into account the compression factor that you are likely
to use for the data on the cluster.

Oner
9 Tem 2014 19:00 tarihinde "YIMEN YIMGA Gael" <ga...@sgcib.com>
yazdı:

> Hello Dear,
>
>
>
> I made an estimation of a number of nodes of a cluster that can be
> supplied by 720GB of data/day.
>
> My estimation gave me *367 datanodes* in a year. I’m a bit afraid by that
> amount of datanodes.
>
> The assumptions, I used are the followings :
>
>
>
> -          Daily supply (feed) : 720GB
>
> -          HDFS replication factor: 3
>
> -          Booked space for each disk outside HDFS: 30%
>
> -          Size of a disk: 3TB.
>
>
>
> I have two questions.
>
>
>
> First, I would like to know if my assumptions are well taken?
>
> Secondly, could someone help me to evaluate that cluster, to let me be
> sure that my results are not to excessive, please ?
>
>
>
> Standing by for your feedback
>
>
>
> Warm regard
>
> *************************************************************************
> This message and any attachments (the "message") are confidential,
> intended solely for the addressee(s), and may contain legally privileged
> information.
> Any unauthorised use or dissemination is prohibited. E-mails are
> susceptible to alteration.
> Neither SOCIETE GENERALE nor any of its subsidiaries or affiliates shall
> be liable for the message if altered, changed or
> falsified.
> Please visit http://swapdisclosure.sgcib.com for important information
> with respect to derivative products.
>                               ************
> Ce message et toutes les pieces jointes (ci-apres le "message") sont
> confidentiels et susceptibles de contenir des informations couvertes
> par le secret professionnel.
> Ce message est etabli a l'intention exclusive de ses destinataires. Toute
> utilisation ou diffusion non autorisee est interdite.
> Tout message electronique est susceptible d'alteration.
> La SOCIETE GENERALE et ses filiales declinent toute responsabilite au
> titre de ce message s'il a ete altere, deforme ou falsifie.
> Veuillez consulter le site http://swapdisclosure.sgcib.com afin de
> recueillir d'importantes informations sur les produits derives.
> *************************************************************************
>