You are viewing a plain text version of this content. The canonical link for it is here.
Posted to hdfs-user@hadoop.apache.org by Shashidhar Rao <ra...@gmail.com> on 2014/04/14 19:57:01 UTC

Time taken to do a word count on 10 TB data.

Hi,

Can somebody provide me a rough estimate of the time taken in hours/mins
for a cluster of say 30 nodes to run a map reduce job to perform a word
count on say 10 TB of data, assuming that the hardware and the map reduce
program is tuned optimally.

Just a rough estimate, it could be 5TB,10 TB or 20 TB data. If not word
count it could be just to analyze the above size of data.

Regards
Shashidhar

Re: Time taken to do a word count on 10 TB data.

Posted by Shashidhar Rao <ra...@gmail.com>.
Thanks stantley shi


On Tue, Apr 15, 2014 at 6:25 AM, Stanley Shi <ss...@gopivotal.com> wrote:

> Rough estimation: since word count requires very little computation, it is
> io centric, we can do estimation based on disk speed.
>
> Assume 10 disk with each 100MBps for each node, that is about 1GBps per
> node; assume 70% utilization in mapper, we have 700MBps for each node. For
> 30 nodes, it is total about 20GBps, so we need about 500 seconds for 10 TB
> data.
> Adding some map reduce overhead and the final merging, say 20%
> overhead, we can expect about 10 minutes here.
>
>
> On Tuesday, April 15, 2014, Shashidhar Rao <ra...@gmail.com>
> wrote:
>
>> Hi,
>>
>> Can somebody provide me a rough estimate of the time taken in hours/mins
>> for a cluster of say 30 nodes to run a map reduce job to perform a word
>> count on say 10 TB of data, assuming that the hardware and the map reduce
>> program is tuned optimally.
>>
>> Just a rough estimate, it could be 5TB,10 TB or 20 TB data. If not word
>> count it could be just to analyze the above size of data.
>>
>> Regards
>> Shashidhar
>>
>
>
> --
> Regards,
> *Stanley Shi,*
>
>
>

Re: Time taken to do a word count on 10 TB data.

Posted by Shashidhar Rao <ra...@gmail.com>.
Thanks stantley shi


On Tue, Apr 15, 2014 at 6:25 AM, Stanley Shi <ss...@gopivotal.com> wrote:

> Rough estimation: since word count requires very little computation, it is
> io centric, we can do estimation based on disk speed.
>
> Assume 10 disk with each 100MBps for each node, that is about 1GBps per
> node; assume 70% utilization in mapper, we have 700MBps for each node. For
> 30 nodes, it is total about 20GBps, so we need about 500 seconds for 10 TB
> data.
> Adding some map reduce overhead and the final merging, say 20%
> overhead, we can expect about 10 minutes here.
>
>
> On Tuesday, April 15, 2014, Shashidhar Rao <ra...@gmail.com>
> wrote:
>
>> Hi,
>>
>> Can somebody provide me a rough estimate of the time taken in hours/mins
>> for a cluster of say 30 nodes to run a map reduce job to perform a word
>> count on say 10 TB of data, assuming that the hardware and the map reduce
>> program is tuned optimally.
>>
>> Just a rough estimate, it could be 5TB,10 TB or 20 TB data. If not word
>> count it could be just to analyze the above size of data.
>>
>> Regards
>> Shashidhar
>>
>
>
> --
> Regards,
> *Stanley Shi,*
>
>
>

Re: Time taken to do a word count on 10 TB data.

Posted by Shashidhar Rao <ra...@gmail.com>.
Thanks stantley shi


On Tue, Apr 15, 2014 at 6:25 AM, Stanley Shi <ss...@gopivotal.com> wrote:

> Rough estimation: since word count requires very little computation, it is
> io centric, we can do estimation based on disk speed.
>
> Assume 10 disk with each 100MBps for each node, that is about 1GBps per
> node; assume 70% utilization in mapper, we have 700MBps for each node. For
> 30 nodes, it is total about 20GBps, so we need about 500 seconds for 10 TB
> data.
> Adding some map reduce overhead and the final merging, say 20%
> overhead, we can expect about 10 minutes here.
>
>
> On Tuesday, April 15, 2014, Shashidhar Rao <ra...@gmail.com>
> wrote:
>
>> Hi,
>>
>> Can somebody provide me a rough estimate of the time taken in hours/mins
>> for a cluster of say 30 nodes to run a map reduce job to perform a word
>> count on say 10 TB of data, assuming that the hardware and the map reduce
>> program is tuned optimally.
>>
>> Just a rough estimate, it could be 5TB,10 TB or 20 TB data. If not word
>> count it could be just to analyze the above size of data.
>>
>> Regards
>> Shashidhar
>>
>
>
> --
> Regards,
> *Stanley Shi,*
>
>
>

Re: Time taken to do a word count on 10 TB data.

Posted by Shashidhar Rao <ra...@gmail.com>.
Thanks stantley shi


On Tue, Apr 15, 2014 at 6:25 AM, Stanley Shi <ss...@gopivotal.com> wrote:

> Rough estimation: since word count requires very little computation, it is
> io centric, we can do estimation based on disk speed.
>
> Assume 10 disk with each 100MBps for each node, that is about 1GBps per
> node; assume 70% utilization in mapper, we have 700MBps for each node. For
> 30 nodes, it is total about 20GBps, so we need about 500 seconds for 10 TB
> data.
> Adding some map reduce overhead and the final merging, say 20%
> overhead, we can expect about 10 minutes here.
>
>
> On Tuesday, April 15, 2014, Shashidhar Rao <ra...@gmail.com>
> wrote:
>
>> Hi,
>>
>> Can somebody provide me a rough estimate of the time taken in hours/mins
>> for a cluster of say 30 nodes to run a map reduce job to perform a word
>> count on say 10 TB of data, assuming that the hardware and the map reduce
>> program is tuned optimally.
>>
>> Just a rough estimate, it could be 5TB,10 TB or 20 TB data. If not word
>> count it could be just to analyze the above size of data.
>>
>> Regards
>> Shashidhar
>>
>
>
> --
> Regards,
> *Stanley Shi,*
>
>
>

Re: Time taken to do a word count on 10 TB data.

Posted by Stanley Shi <ss...@gopivotal.com>.
Rough estimation: since word count requires very little computation, it is
io centric, we can do estimation based on disk speed.

Assume 10 disk with each 100MBps for each node, that is about 1GBps per
node; assume 70% utilization in mapper, we have 700MBps for each node. For
30 nodes, it is total about 20GBps, so we need about 500 seconds for 10 TB
data.
Adding some map reduce overhead and the final merging, say 20% overhead, we
can expect about 10 minutes here.


On Tuesday, April 15, 2014, Shashidhar Rao <ra...@gmail.com>
wrote:

> Hi,
>
> Can somebody provide me a rough estimate of the time taken in hours/mins
> for a cluster of say 30 nodes to run a map reduce job to perform a word
> count on say 10 TB of data, assuming that the hardware and the map reduce
> program is tuned optimally.
>
> Just a rough estimate, it could be 5TB,10 TB or 20 TB data. If not word
> count it could be just to analyze the above size of data.
>
> Regards
> Shashidhar
>


-- 
Regards,
*Stanley Shi,*

Re: Time taken to do a word count on 10 TB data.

Posted by Stanley Shi <ss...@gopivotal.com>.
Rough estimation: since word count requires very little computation, it is
io centric, we can do estimation based on disk speed.

Assume 10 disk with each 100MBps for each node, that is about 1GBps per
node; assume 70% utilization in mapper, we have 700MBps for each node. For
30 nodes, it is total about 20GBps, so we need about 500 seconds for 10 TB
data.
Adding some map reduce overhead and the final merging, say 20% overhead, we
can expect about 10 minutes here.


On Tuesday, April 15, 2014, Shashidhar Rao <ra...@gmail.com>
wrote:

> Hi,
>
> Can somebody provide me a rough estimate of the time taken in hours/mins
> for a cluster of say 30 nodes to run a map reduce job to perform a word
> count on say 10 TB of data, assuming that the hardware and the map reduce
> program is tuned optimally.
>
> Just a rough estimate, it could be 5TB,10 TB or 20 TB data. If not word
> count it could be just to analyze the above size of data.
>
> Regards
> Shashidhar
>


-- 
Regards,
*Stanley Shi,*

Re: Time taken to do a word count on 10 TB data.

Posted by Stanley Shi <ss...@gopivotal.com>.
Rough estimation: since word count requires very little computation, it is
io centric, we can do estimation based on disk speed.

Assume 10 disk with each 100MBps for each node, that is about 1GBps per
node; assume 70% utilization in mapper, we have 700MBps for each node. For
30 nodes, it is total about 20GBps, so we need about 500 seconds for 10 TB
data.
Adding some map reduce overhead and the final merging, say 20% overhead, we
can expect about 10 minutes here.


On Tuesday, April 15, 2014, Shashidhar Rao <ra...@gmail.com>
wrote:

> Hi,
>
> Can somebody provide me a rough estimate of the time taken in hours/mins
> for a cluster of say 30 nodes to run a map reduce job to perform a word
> count on say 10 TB of data, assuming that the hardware and the map reduce
> program is tuned optimally.
>
> Just a rough estimate, it could be 5TB,10 TB or 20 TB data. If not word
> count it could be just to analyze the above size of data.
>
> Regards
> Shashidhar
>


-- 
Regards,
*Stanley Shi,*

Re: Time taken to do a word count on 10 TB data.

Posted by Stanley Shi <ss...@gopivotal.com>.
Rough estimation: since word count requires very little computation, it is
io centric, we can do estimation based on disk speed.

Assume 10 disk with each 100MBps for each node, that is about 1GBps per
node; assume 70% utilization in mapper, we have 700MBps for each node. For
30 nodes, it is total about 20GBps, so we need about 500 seconds for 10 TB
data.
Adding some map reduce overhead and the final merging, say 20% overhead, we
can expect about 10 minutes here.


On Tuesday, April 15, 2014, Shashidhar Rao <ra...@gmail.com>
wrote:

> Hi,
>
> Can somebody provide me a rough estimate of the time taken in hours/mins
> for a cluster of say 30 nodes to run a map reduce job to perform a word
> count on say 10 TB of data, assuming that the hardware and the map reduce
> program is tuned optimally.
>
> Just a rough estimate, it could be 5TB,10 TB or 20 TB data. If not word
> count it could be just to analyze the above size of data.
>
> Regards
> Shashidhar
>


-- 
Regards,
*Stanley Shi,*