You are viewing a plain text version of this content. The canonical link for it is here.

Posted to hdfs-user@hadoop.apache.org by "Kandoi, Nikhil" <Ni...@emc.com> on 2013/12/17 11:39:30 UTC

Estimating the time of my hadoop jobs

Hello everyone,

I am new to Hadoop and would like to see if I'm on the right track.
Currently I'm developing an application which would ingest logs of order of 60-70 GB of data/day and would then do
Some analysis on them
Now the infrastructure that I have is a 4 node cluster( all nodes on Virtual Machines) , all nodes have 4GB ram.

But when I try to run the dataset (which is a sample dataset at this point ) of about 30 GB, it takes about 3 hrs to process all of it.

I would like to know is it normal for this kind of infrastructure to take this amount of time.


Thank you

Nikhil Kandoi/

RE: Estimating the time of my hadoop jobs

Posted by "Kandoi, Nikhil" <Ni...@emc.com>.

Thank you everyone for your solution ,

I think I got an idea of where I was making a mistake, not only was I setting up and destroying the jvm for a single Hadoop jobs
I was also creating numerous Hadoop jobs for processing different files which can be handled in one single job.

Will try the solution that I think would help solve the problem.

Regards,
Nikhil


From: Shekhar Sharma [mailto:shekhar2581@gmail.com]
Sent: Tuesday, December 17, 2013 9:12 PM
To: user@hadoop.apache.org
Subject: Re: Estimating the time of my hadoop jobs

Apart from what Devin has suggested there are other factors which could be worth while noting when you are running your hadoop cluster on virtual machines.

(1) How many map and reduce slots are there in cluster?

 Since you have not mentioned and you are using 4 node hadoop cluster so total of 8map slots and 8 reduce slots are present.
What does it mean?
It means that at a time on your cluster only 8 map tasks and 8 reduce task will run parallely and other task have to wait..


(2) Since you have not mentioned anywhere that whether 30GB of data is made up of lot of smaller files ( less than block size) or bigger file...let us do a simple calculation assuming only one file of 30GB and assuming a block size of 64MB

30GB = 30 * 1024 * 1024* 1024 = 32212254720

64MB = 64 * 1024*1024 =67108864


Total Number of blocks the data will be broken  = (32212254720) / (67108864) = 480 Blocks

Now this means you will be running 480 Map tasks ( keeping in mind inputsplit size = block size)...But since you have only 8 map slots so at a time only 8 map task will run and others will be pending...

Assuming all the 8map tasks finishes at one time then you will have 480/8 = 60 map waves

 (3) Now you know that each task runs on a separate JVM, that means to say for every task a jvm is created and then after the task is finished the JVM is tear down..this is also a bottle neck, creation and destroy of JVM

So try reusing the same JVM. There is option where in you can reuse the JVM

(4) SInce you are working with such  big data, try using combiner?

(5) Also try compressing the data and the intermediate output of the mappers and reducer op
   ---First try with sequence file
   ---Then try with snappy compression codec


By the above pointers if you can bring down the timings to atleast 1 hour or so..
Then with the same 4 node cluster and Hadoop running on separate physical machine you will for sure see the job getting completed in 15-30minutes..[ Please refer Devin's comments ]



My suggestion is get the optimal performance on your virtual machine and then you go for real hadoop cluster. You will for sure see the performance improvement



Regards,
Som Shekhar Sharma
+91-8197243810

On Tue, Dec 17, 2013 at 6:42 PM, Devin Suiter RDX <ds...@rdx.com>> wrote:
Nikhil,

One of the problems you run into with Hadoop in Virtual Machine environments is performance issues when they are all running on the same physical host. With a VM, even though you are giving them 4 GB of RAM, and a virtual CPU and disk, if the virtual machines are sharing physical components like processor and physical storage medium, they compete for resources at the physical level. Even if you have the VM on a single host, or on a multi-core host with multiple disks and they are sharing as few resources as possible, there will still be a performance hit when the VM information has to pass through the hypervisor layer - co-scheduling resources with the host and things like that.

Does that make sense?

It's generally accepted that due to these issues, Hadoop in virtual environments does not offer the same performance benefits as a physical Hadoop cluster. It can be used pretty well with even low-quality hardware though, so so, maybe you can acquire some used desktops and install your favorite Linux flavor on them and make a cluster - some people have even run Hadoop on Raspberry Pi clusters.


Devin Suiter
Jr. Data Solutions Software Engineer
[http://i76.servimg.com/u/f76/12/40/55/53/untitl10.png]
100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
Google Voice: 412-256-8556<tel:412-256-8556> | www.rdx.com<http://www.rdx.com/>

On Tue, Dec 17, 2013 at 6:26 AM, Kandoi, Nikhil <Ni...@emc.com>> wrote:
I know this foolish of me to ask this, because there are a lot of factors that affect this,
but why is it taking so much time, can anyone suggest possible reasons for it, or if anyone has faced such issue before

Thanks,
Nikhil Kandoi
P.S - I am  Hadoop-1.0.3  for this application, so I wonder if this version has got something to do with it.

From: Azuryy Yu [mailto:azuryyyu@gmail.com<ma...@gmail.com>]
Sent: Tuesday, December 17, 2013 4:14 PM
To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Re: Estimating the time of my hadoop jobs

Hi Kandoi,
It depends on:
how many cores on each VNode
how complicated of your analysis application

But I don't think it's normal spent 3hr to process 30GB data even on your *not good* hareware.





On Tue, Dec 17, 2013 at 6:39 PM, Kandoi, Nikhil <Ni...@emc.com>> wrote:
Hello everyone,

I am new to Hadoop and would like to see if I'm on the right track.
Currently I'm developing an application which would ingest logs of order of 60-70 GB of data/day and would then do
Some analysis on them
Now the infrastructure that I have is a 4 node cluster( all nodes on Virtual Machines) , all nodes have 4GB ram.

But when I try to run the dataset (which is a sample dataset at this point ) of about 30 GB, it takes about 3 hrs to process all of it.

I would like to know is it normal for this kind of infrastructure to take this amount of time.


Thank you

Nikhil Kandoi/

RE: Estimating the time of my hadoop jobs

Posted by "Kandoi, Nikhil" <Ni...@emc.com>.

Thank you everyone for your solution ,

I think I got an idea of where I was making a mistake, not only was I setting up and destroying the jvm for a single Hadoop jobs
I was also creating numerous Hadoop jobs for processing different files which can be handled in one single job.

Will try the solution that I think would help solve the problem.

Regards,
Nikhil


From: Shekhar Sharma [mailto:shekhar2581@gmail.com]
Sent: Tuesday, December 17, 2013 9:12 PM
To: user@hadoop.apache.org
Subject: Re: Estimating the time of my hadoop jobs

Apart from what Devin has suggested there are other factors which could be worth while noting when you are running your hadoop cluster on virtual machines.

(1) How many map and reduce slots are there in cluster?

 Since you have not mentioned and you are using 4 node hadoop cluster so total of 8map slots and 8 reduce slots are present.
What does it mean?
It means that at a time on your cluster only 8 map tasks and 8 reduce task will run parallely and other task have to wait..


(2) Since you have not mentioned anywhere that whether 30GB of data is made up of lot of smaller files ( less than block size) or bigger file...let us do a simple calculation assuming only one file of 30GB and assuming a block size of 64MB

30GB = 30 * 1024 * 1024* 1024 = 32212254720

64MB = 64 * 1024*1024 =67108864


Total Number of blocks the data will be broken  = (32212254720) / (67108864) = 480 Blocks

Now this means you will be running 480 Map tasks ( keeping in mind inputsplit size = block size)...But since you have only 8 map slots so at a time only 8 map task will run and others will be pending...

Assuming all the 8map tasks finishes at one time then you will have 480/8 = 60 map waves

 (3) Now you know that each task runs on a separate JVM, that means to say for every task a jvm is created and then after the task is finished the JVM is tear down..this is also a bottle neck, creation and destroy of JVM

So try reusing the same JVM. There is option where in you can reuse the JVM

(4) SInce you are working with such  big data, try using combiner?

(5) Also try compressing the data and the intermediate output of the mappers and reducer op
   ---First try with sequence file
   ---Then try with snappy compression codec


By the above pointers if you can bring down the timings to atleast 1 hour or so..
Then with the same 4 node cluster and Hadoop running on separate physical machine you will for sure see the job getting completed in 15-30minutes..[ Please refer Devin's comments ]



My suggestion is get the optimal performance on your virtual machine and then you go for real hadoop cluster. You will for sure see the performance improvement



Regards,
Som Shekhar Sharma
+91-8197243810

On Tue, Dec 17, 2013 at 6:42 PM, Devin Suiter RDX <ds...@rdx.com>> wrote:
Nikhil,

One of the problems you run into with Hadoop in Virtual Machine environments is performance issues when they are all running on the same physical host. With a VM, even though you are giving them 4 GB of RAM, and a virtual CPU and disk, if the virtual machines are sharing physical components like processor and physical storage medium, they compete for resources at the physical level. Even if you have the VM on a single host, or on a multi-core host with multiple disks and they are sharing as few resources as possible, there will still be a performance hit when the VM information has to pass through the hypervisor layer - co-scheduling resources with the host and things like that.

Does that make sense?

It's generally accepted that due to these issues, Hadoop in virtual environments does not offer the same performance benefits as a physical Hadoop cluster. It can be used pretty well with even low-quality hardware though, so so, maybe you can acquire some used desktops and install your favorite Linux flavor on them and make a cluster - some people have even run Hadoop on Raspberry Pi clusters.


Devin Suiter
Jr. Data Solutions Software Engineer
[http://i76.servimg.com/u/f76/12/40/55/53/untitl10.png]
100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
Google Voice: 412-256-8556<tel:412-256-8556> | www.rdx.com<http://www.rdx.com/>

On Tue, Dec 17, 2013 at 6:26 AM, Kandoi, Nikhil <Ni...@emc.com>> wrote:
I know this foolish of me to ask this, because there are a lot of factors that affect this,
but why is it taking so much time, can anyone suggest possible reasons for it, or if anyone has faced such issue before

Thanks,
Nikhil Kandoi
P.S - I am  Hadoop-1.0.3  for this application, so I wonder if this version has got something to do with it.

From: Azuryy Yu [mailto:azuryyyu@gmail.com<ma...@gmail.com>]
Sent: Tuesday, December 17, 2013 4:14 PM
To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Re: Estimating the time of my hadoop jobs

Hi Kandoi,
It depends on:
how many cores on each VNode
how complicated of your analysis application

But I don't think it's normal spent 3hr to process 30GB data even on your *not good* hareware.





On Tue, Dec 17, 2013 at 6:39 PM, Kandoi, Nikhil <Ni...@emc.com>> wrote:
Hello everyone,

I am new to Hadoop and would like to see if I'm on the right track.
Currently I'm developing an application which would ingest logs of order of 60-70 GB of data/day and would then do
Some analysis on them
Now the infrastructure that I have is a 4 node cluster( all nodes on Virtual Machines) , all nodes have 4GB ram.

But when I try to run the dataset (which is a sample dataset at this point ) of about 30 GB, it takes about 3 hrs to process all of it.

I would like to know is it normal for this kind of infrastructure to take this amount of time.


Thank you

Nikhil Kandoi/

RE: Estimating the time of my hadoop jobs

Posted by "Kandoi, Nikhil" <Ni...@emc.com>.

Thank you everyone for your solution ,

I think I got an idea of where I was making a mistake, not only was I setting up and destroying the jvm for a single Hadoop jobs
I was also creating numerous Hadoop jobs for processing different files which can be handled in one single job.

Will try the solution that I think would help solve the problem.

Regards,
Nikhil


From: Shekhar Sharma [mailto:shekhar2581@gmail.com]
Sent: Tuesday, December 17, 2013 9:12 PM
To: user@hadoop.apache.org
Subject: Re: Estimating the time of my hadoop jobs

Apart from what Devin has suggested there are other factors which could be worth while noting when you are running your hadoop cluster on virtual machines.

(1) How many map and reduce slots are there in cluster?

 Since you have not mentioned and you are using 4 node hadoop cluster so total of 8map slots and 8 reduce slots are present.
What does it mean?
It means that at a time on your cluster only 8 map tasks and 8 reduce task will run parallely and other task have to wait..


(2) Since you have not mentioned anywhere that whether 30GB of data is made up of lot of smaller files ( less than block size) or bigger file...let us do a simple calculation assuming only one file of 30GB and assuming a block size of 64MB

30GB = 30 * 1024 * 1024* 1024 = 32212254720

64MB = 64 * 1024*1024 =67108864


Total Number of blocks the data will be broken  = (32212254720) / (67108864) = 480 Blocks

Now this means you will be running 480 Map tasks ( keeping in mind inputsplit size = block size)...But since you have only 8 map slots so at a time only 8 map task will run and others will be pending...

Assuming all the 8map tasks finishes at one time then you will have 480/8 = 60 map waves

 (3) Now you know that each task runs on a separate JVM, that means to say for every task a jvm is created and then after the task is finished the JVM is tear down..this is also a bottle neck, creation and destroy of JVM

So try reusing the same JVM. There is option where in you can reuse the JVM

(4) SInce you are working with such  big data, try using combiner?

(5) Also try compressing the data and the intermediate output of the mappers and reducer op
   ---First try with sequence file
   ---Then try with snappy compression codec


By the above pointers if you can bring down the timings to atleast 1 hour or so..
Then with the same 4 node cluster and Hadoop running on separate physical machine you will for sure see the job getting completed in 15-30minutes..[ Please refer Devin's comments ]



My suggestion is get the optimal performance on your virtual machine and then you go for real hadoop cluster. You will for sure see the performance improvement



Regards,
Som Shekhar Sharma
+91-8197243810

On Tue, Dec 17, 2013 at 6:42 PM, Devin Suiter RDX <ds...@rdx.com>> wrote:
Nikhil,

One of the problems you run into with Hadoop in Virtual Machine environments is performance issues when they are all running on the same physical host. With a VM, even though you are giving them 4 GB of RAM, and a virtual CPU and disk, if the virtual machines are sharing physical components like processor and physical storage medium, they compete for resources at the physical level. Even if you have the VM on a single host, or on a multi-core host with multiple disks and they are sharing as few resources as possible, there will still be a performance hit when the VM information has to pass through the hypervisor layer - co-scheduling resources with the host and things like that.

Does that make sense?

It's generally accepted that due to these issues, Hadoop in virtual environments does not offer the same performance benefits as a physical Hadoop cluster. It can be used pretty well with even low-quality hardware though, so so, maybe you can acquire some used desktops and install your favorite Linux flavor on them and make a cluster - some people have even run Hadoop on Raspberry Pi clusters.


Devin Suiter
Jr. Data Solutions Software Engineer
[http://i76.servimg.com/u/f76/12/40/55/53/untitl10.png]
100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
Google Voice: 412-256-8556<tel:412-256-8556> | www.rdx.com<http://www.rdx.com/>

On Tue, Dec 17, 2013 at 6:26 AM, Kandoi, Nikhil <Ni...@emc.com>> wrote:
I know this foolish of me to ask this, because there are a lot of factors that affect this,
but why is it taking so much time, can anyone suggest possible reasons for it, or if anyone has faced such issue before

Thanks,
Nikhil Kandoi
P.S - I am  Hadoop-1.0.3  for this application, so I wonder if this version has got something to do with it.

From: Azuryy Yu [mailto:azuryyyu@gmail.com<ma...@gmail.com>]
Sent: Tuesday, December 17, 2013 4:14 PM
To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Re: Estimating the time of my hadoop jobs

Hi Kandoi,
It depends on:
how many cores on each VNode
how complicated of your analysis application

But I don't think it's normal spent 3hr to process 30GB data even on your *not good* hareware.





On Tue, Dec 17, 2013 at 6:39 PM, Kandoi, Nikhil <Ni...@emc.com>> wrote:
Hello everyone,

I am new to Hadoop and would like to see if I'm on the right track.
Currently I'm developing an application which would ingest logs of order of 60-70 GB of data/day and would then do
Some analysis on them
Now the infrastructure that I have is a 4 node cluster( all nodes on Virtual Machines) , all nodes have 4GB ram.

But when I try to run the dataset (which is a sample dataset at this point ) of about 30 GB, it takes about 3 hrs to process all of it.

I would like to know is it normal for this kind of infrastructure to take this amount of time.


Thank you

Nikhil Kandoi/

RE: Estimating the time of my hadoop jobs

Posted by "Kandoi, Nikhil" <Ni...@emc.com>.

Thank you everyone for your solution ,

I think I got an idea of where I was making a mistake, not only was I setting up and destroying the jvm for a single Hadoop jobs
I was also creating numerous Hadoop jobs for processing different files which can be handled in one single job.

Will try the solution that I think would help solve the problem.

Regards,
Nikhil


From: Shekhar Sharma [mailto:shekhar2581@gmail.com]
Sent: Tuesday, December 17, 2013 9:12 PM
To: user@hadoop.apache.org
Subject: Re: Estimating the time of my hadoop jobs

Apart from what Devin has suggested there are other factors which could be worth while noting when you are running your hadoop cluster on virtual machines.

(1) How many map and reduce slots are there in cluster?

 Since you have not mentioned and you are using 4 node hadoop cluster so total of 8map slots and 8 reduce slots are present.
What does it mean?
It means that at a time on your cluster only 8 map tasks and 8 reduce task will run parallely and other task have to wait..


(2) Since you have not mentioned anywhere that whether 30GB of data is made up of lot of smaller files ( less than block size) or bigger file...let us do a simple calculation assuming only one file of 30GB and assuming a block size of 64MB

30GB = 30 * 1024 * 1024* 1024 = 32212254720

64MB = 64 * 1024*1024 =67108864


Total Number of blocks the data will be broken  = (32212254720) / (67108864) = 480 Blocks

Now this means you will be running 480 Map tasks ( keeping in mind inputsplit size = block size)...But since you have only 8 map slots so at a time only 8 map task will run and others will be pending...

Assuming all the 8map tasks finishes at one time then you will have 480/8 = 60 map waves

 (3) Now you know that each task runs on a separate JVM, that means to say for every task a jvm is created and then after the task is finished the JVM is tear down..this is also a bottle neck, creation and destroy of JVM

So try reusing the same JVM. There is option where in you can reuse the JVM

(4) SInce you are working with such  big data, try using combiner?

(5) Also try compressing the data and the intermediate output of the mappers and reducer op
   ---First try with sequence file
   ---Then try with snappy compression codec


By the above pointers if you can bring down the timings to atleast 1 hour or so..
Then with the same 4 node cluster and Hadoop running on separate physical machine you will for sure see the job getting completed in 15-30minutes..[ Please refer Devin's comments ]



My suggestion is get the optimal performance on your virtual machine and then you go for real hadoop cluster. You will for sure see the performance improvement



Regards,
Som Shekhar Sharma
+91-8197243810

On Tue, Dec 17, 2013 at 6:42 PM, Devin Suiter RDX <ds...@rdx.com>> wrote:
Nikhil,

One of the problems you run into with Hadoop in Virtual Machine environments is performance issues when they are all running on the same physical host. With a VM, even though you are giving them 4 GB of RAM, and a virtual CPU and disk, if the virtual machines are sharing physical components like processor and physical storage medium, they compete for resources at the physical level. Even if you have the VM on a single host, or on a multi-core host with multiple disks and they are sharing as few resources as possible, there will still be a performance hit when the VM information has to pass through the hypervisor layer - co-scheduling resources with the host and things like that.

Does that make sense?

It's generally accepted that due to these issues, Hadoop in virtual environments does not offer the same performance benefits as a physical Hadoop cluster. It can be used pretty well with even low-quality hardware though, so so, maybe you can acquire some used desktops and install your favorite Linux flavor on them and make a cluster - some people have even run Hadoop on Raspberry Pi clusters.


Devin Suiter
Jr. Data Solutions Software Engineer
[http://i76.servimg.com/u/f76/12/40/55/53/untitl10.png]
100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
Google Voice: 412-256-8556<tel:412-256-8556> | www.rdx.com<http://www.rdx.com/>

On Tue, Dec 17, 2013 at 6:26 AM, Kandoi, Nikhil <Ni...@emc.com>> wrote:
I know this foolish of me to ask this, because there are a lot of factors that affect this,
but why is it taking so much time, can anyone suggest possible reasons for it, or if anyone has faced such issue before

Thanks,
Nikhil Kandoi
P.S - I am  Hadoop-1.0.3  for this application, so I wonder if this version has got something to do with it.

From: Azuryy Yu [mailto:azuryyyu@gmail.com<ma...@gmail.com>]
Sent: Tuesday, December 17, 2013 4:14 PM
To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Re: Estimating the time of my hadoop jobs

Hi Kandoi,
It depends on:
how many cores on each VNode
how complicated of your analysis application

But I don't think it's normal spent 3hr to process 30GB data even on your *not good* hareware.





On Tue, Dec 17, 2013 at 6:39 PM, Kandoi, Nikhil <Ni...@emc.com>> wrote:
Hello everyone,

I am new to Hadoop and would like to see if I'm on the right track.
Currently I'm developing an application which would ingest logs of order of 60-70 GB of data/day and would then do
Some analysis on them
Now the infrastructure that I have is a 4 node cluster( all nodes on Virtual Machines) , all nodes have 4GB ram.

But when I try to run the dataset (which is a sample dataset at this point ) of about 30 GB, it takes about 3 hrs to process all of it.

I would like to know is it normal for this kind of infrastructure to take this amount of time.


Thank you

Nikhil Kandoi/

Re: Estimating the time of my hadoop jobs

Posted by Shekhar Sharma <sh...@gmail.com>.

Apart from what Devin has suggested there are other factors which could be
worth while noting when you are running your hadoop cluster on virtual
machines.

(1) How many map and reduce slots are there in cluster?

 Since you have not mentioned and you are using 4 node hadoop cluster so
total of 8map slots and 8 reduce slots are present.
What does it mean?
It means that at a time on your cluster only 8 map tasks and 8 reduce task
will run parallely and other task have to wait..


(2) Since you have not mentioned anywhere that whether 30GB of data is made
up of lot of smaller files ( less than block size) or bigger file...let us
do a simple calculation assuming only one file of 30GB and assuming a block
size of 64MB

30GB = 30 * 1024 * 1024* 1024 = 32212254720

64MB = 64 * 1024*1024 =67108864


Total Number of blocks the data will be broken  = (32212254720) /
(67108864) = 480 Blocks

Now this means you will be running 480 Map tasks ( keeping in mind
inputsplit size = block size)...But since you have only 8 map slots so at a
time only 8 map task will run and others will be pending...

Assuming all the 8map tasks finishes at one time then you will have 480/8 =
60 map waves

 (3) Now you know that each task runs on a separate JVM, that means to say
for every task a jvm is created and then after the task is finished the JVM
is tear down..this is also a bottle neck, creation and destroy of JVM

So try reusing the same JVM. There is option where in you can reuse the JVM

(4) SInce you are working with such  big data, try using combiner?

(5) Also try compressing the data and the intermediate output of the
mappers and reducer op
   ---First try with sequence file
   ---Then try with snappy compression codec


By the above pointers if you can bring down the timings to atleast 1 hour
or so..
Then with the same 4 node cluster and Hadoop running on separate physical
machine you will for sure see the job getting completed in 15-30minutes..[
Please refer Devin's comments ]



My suggestion is get the optimal performance on your virtual machine and
then you go for real hadoop cluster. You will for sure see the performance
improvement



Regards,
Som Shekhar Sharma
+91-8197243810


On Tue, Dec 17, 2013 at 6:42 PM, Devin Suiter RDX <ds...@rdx.com> wrote:

> Nikhil,
>
> One of the problems you run into with Hadoop in Virtual Machine
> environments is performance issues when they are all running on the same
> physical host. With a VM, even though you are giving them 4 GB of RAM, and
> a virtual CPU and disk, if the virtual machines are sharing physical
> components like processor and physical storage medium, they compete for
> resources at the physical level. Even if you have the VM on a single host,
> or on a multi-core host with multiple disks and they are sharing as few
> resources as possible, there will still be a performance hit when the VM
> information has to pass through the hypervisor layer - co-scheduling
> resources with the host and things like that.
>
> Does that make sense?
>
> It's generally accepted that due to these issues, Hadoop in virtual
> environments does not offer the same performance benefits as a physical
> Hadoop cluster. It can be used pretty well with even low-quality hardware
> though, so so, maybe you can acquire some used desktops and install your
> favorite Linux flavor on them and make a cluster - some people have even
> run Hadoop on Raspberry Pi clusters.
>
>
> *Devin Suiter*
> Jr. Data Solutions Software Engineer
> 100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
> Google Voice: 412-256-8556 | www.rdx.com
>
>
> On Tue, Dec 17, 2013 at 6:26 AM, Kandoi, Nikhil <Ni...@emc.com>wrote:
>
>> I know this foolish of me to ask this, because there are a lot of factors
>> that affect this,
>>
>> but why is it taking so much time, can anyone suggest possible reasons
>> for it, or if anyone has faced such issue before
>>
>>
>>
>> Thanks,
>>
>> Nikhil Kandoi
>>
>>  P.S – I am  Hadoop-1.0.3  for this application, so I wonder if this
>> version has got something to do with it.
>>
>>
>>
>> *From:* Azuryy Yu [mailto:azuryyyu@gmail.com]
>> *Sent:* Tuesday, December 17, 2013 4:14 PM
>> *To:* user@hadoop.apache.org
>> *Subject:* Re: Estimating the time of my hadoop jobs
>>
>>
>>
>> Hi Kandoi,
>>
>> It depends on:
>>
>> how many cores on each VNode
>>
>> how complicated of your analysis application
>>
>>
>>
>> But I don't think it's normal spent 3hr to process 30GB data even on your
>> *not good* hareware.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On Tue, Dec 17, 2013 at 6:39 PM, Kandoi, Nikhil <Ni...@emc.com>
>> wrote:
>>
>> Hello everyone,
>>
>>
>>
>> I am new to Hadoop and would like to see if I’m on the right track.
>>
>> Currently I’m developing an application which would ingest logs of order
>> of 60-70 GB of data/day and would then do
>>
>> Some analysis on them
>>
>> Now the infrastructure that I have is a 4 node cluster( all nodes on
>> Virtual Machines) , all nodes have 4GB ram.
>>
>>
>>
>> But when I try to run the dataset (which is a sample dataset at this
>> point ) of about 30 GB, it takes about 3 hrs to process all of it.
>>
>>
>>
>> I would like to know is it normal for this kind of infrastructure to take
>> this amount of time.
>>
>>
>>
>>
>>
>> Thank you
>>
>>
>>
>> Nikhil Kandoi/
>>
>>
>>
>
>

Re: Estimating the time of my hadoop jobs

Posted by Shekhar Sharma <sh...@gmail.com>.

Apart from what Devin has suggested there are other factors which could be
worth while noting when you are running your hadoop cluster on virtual
machines.

(1) How many map and reduce slots are there in cluster?

 Since you have not mentioned and you are using 4 node hadoop cluster so
total of 8map slots and 8 reduce slots are present.
What does it mean?
It means that at a time on your cluster only 8 map tasks and 8 reduce task
will run parallely and other task have to wait..


(2) Since you have not mentioned anywhere that whether 30GB of data is made
up of lot of smaller files ( less than block size) or bigger file...let us
do a simple calculation assuming only one file of 30GB and assuming a block
size of 64MB

30GB = 30 * 1024 * 1024* 1024 = 32212254720

64MB = 64 * 1024*1024 =67108864


Total Number of blocks the data will be broken  = (32212254720) /
(67108864) = 480 Blocks

Now this means you will be running 480 Map tasks ( keeping in mind
inputsplit size = block size)...But since you have only 8 map slots so at a
time only 8 map task will run and others will be pending...

Assuming all the 8map tasks finishes at one time then you will have 480/8 =
60 map waves

 (3) Now you know that each task runs on a separate JVM, that means to say
for every task a jvm is created and then after the task is finished the JVM
is tear down..this is also a bottle neck, creation and destroy of JVM

So try reusing the same JVM. There is option where in you can reuse the JVM

(4) SInce you are working with such  big data, try using combiner?

(5) Also try compressing the data and the intermediate output of the
mappers and reducer op
   ---First try with sequence file
   ---Then try with snappy compression codec


By the above pointers if you can bring down the timings to atleast 1 hour
or so..
Then with the same 4 node cluster and Hadoop running on separate physical
machine you will for sure see the job getting completed in 15-30minutes..[
Please refer Devin's comments ]



My suggestion is get the optimal performance on your virtual machine and
then you go for real hadoop cluster. You will for sure see the performance
improvement



Regards,
Som Shekhar Sharma
+91-8197243810


On Tue, Dec 17, 2013 at 6:42 PM, Devin Suiter RDX <ds...@rdx.com> wrote:

> Nikhil,
>
> One of the problems you run into with Hadoop in Virtual Machine
> environments is performance issues when they are all running on the same
> physical host. With a VM, even though you are giving them 4 GB of RAM, and
> a virtual CPU and disk, if the virtual machines are sharing physical
> components like processor and physical storage medium, they compete for
> resources at the physical level. Even if you have the VM on a single host,
> or on a multi-core host with multiple disks and they are sharing as few
> resources as possible, there will still be a performance hit when the VM
> information has to pass through the hypervisor layer - co-scheduling
> resources with the host and things like that.
>
> Does that make sense?
>
> It's generally accepted that due to these issues, Hadoop in virtual
> environments does not offer the same performance benefits as a physical
> Hadoop cluster. It can be used pretty well with even low-quality hardware
> though, so so, maybe you can acquire some used desktops and install your
> favorite Linux flavor on them and make a cluster - some people have even
> run Hadoop on Raspberry Pi clusters.
>
>
> *Devin Suiter*
> Jr. Data Solutions Software Engineer
> 100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
> Google Voice: 412-256-8556 | www.rdx.com
>
>
> On Tue, Dec 17, 2013 at 6:26 AM, Kandoi, Nikhil <Ni...@emc.com>wrote:
>
>> I know this foolish of me to ask this, because there are a lot of factors
>> that affect this,
>>
>> but why is it taking so much time, can anyone suggest possible reasons
>> for it, or if anyone has faced such issue before
>>
>>
>>
>> Thanks,
>>
>> Nikhil Kandoi
>>
>>  P.S – I am  Hadoop-1.0.3  for this application, so I wonder if this
>> version has got something to do with it.
>>
>>
>>
>> *From:* Azuryy Yu [mailto:azuryyyu@gmail.com]
>> *Sent:* Tuesday, December 17, 2013 4:14 PM
>> *To:* user@hadoop.apache.org
>> *Subject:* Re: Estimating the time of my hadoop jobs
>>
>>
>>
>> Hi Kandoi,
>>
>> It depends on:
>>
>> how many cores on each VNode
>>
>> how complicated of your analysis application
>>
>>
>>
>> But I don't think it's normal spent 3hr to process 30GB data even on your
>> *not good* hareware.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On Tue, Dec 17, 2013 at 6:39 PM, Kandoi, Nikhil <Ni...@emc.com>
>> wrote:
>>
>> Hello everyone,
>>
>>
>>
>> I am new to Hadoop and would like to see if I’m on the right track.
>>
>> Currently I’m developing an application which would ingest logs of order
>> of 60-70 GB of data/day and would then do
>>
>> Some analysis on them
>>
>> Now the infrastructure that I have is a 4 node cluster( all nodes on
>> Virtual Machines) , all nodes have 4GB ram.
>>
>>
>>
>> But when I try to run the dataset (which is a sample dataset at this
>> point ) of about 30 GB, it takes about 3 hrs to process all of it.
>>
>>
>>
>> I would like to know is it normal for this kind of infrastructure to take
>> this amount of time.
>>
>>
>>
>>
>>
>> Thank you
>>
>>
>>
>> Nikhil Kandoi/
>>
>>
>>
>
>

Re: Estimating the time of my hadoop jobs

Posted by Shekhar Sharma <sh...@gmail.com>.

Apart from what Devin has suggested there are other factors which could be
worth while noting when you are running your hadoop cluster on virtual
machines.

(1) How many map and reduce slots are there in cluster?

 Since you have not mentioned and you are using 4 node hadoop cluster so
total of 8map slots and 8 reduce slots are present.
What does it mean?
It means that at a time on your cluster only 8 map tasks and 8 reduce task
will run parallely and other task have to wait..


(2) Since you have not mentioned anywhere that whether 30GB of data is made
up of lot of smaller files ( less than block size) or bigger file...let us
do a simple calculation assuming only one file of 30GB and assuming a block
size of 64MB

30GB = 30 * 1024 * 1024* 1024 = 32212254720

64MB = 64 * 1024*1024 =67108864


Total Number of blocks the data will be broken  = (32212254720) /
(67108864) = 480 Blocks

Now this means you will be running 480 Map tasks ( keeping in mind
inputsplit size = block size)...But since you have only 8 map slots so at a
time only 8 map task will run and others will be pending...

Assuming all the 8map tasks finishes at one time then you will have 480/8 =
60 map waves

 (3) Now you know that each task runs on a separate JVM, that means to say
for every task a jvm is created and then after the task is finished the JVM
is tear down..this is also a bottle neck, creation and destroy of JVM

So try reusing the same JVM. There is option where in you can reuse the JVM

(4) SInce you are working with such  big data, try using combiner?

(5) Also try compressing the data and the intermediate output of the
mappers and reducer op
   ---First try with sequence file
   ---Then try with snappy compression codec


By the above pointers if you can bring down the timings to atleast 1 hour
or so..
Then with the same 4 node cluster and Hadoop running on separate physical
machine you will for sure see the job getting completed in 15-30minutes..[
Please refer Devin's comments ]



My suggestion is get the optimal performance on your virtual machine and
then you go for real hadoop cluster. You will for sure see the performance
improvement



Regards,
Som Shekhar Sharma
+91-8197243810


On Tue, Dec 17, 2013 at 6:42 PM, Devin Suiter RDX <ds...@rdx.com> wrote:

> Nikhil,
>
> One of the problems you run into with Hadoop in Virtual Machine
> environments is performance issues when they are all running on the same
> physical host. With a VM, even though you are giving them 4 GB of RAM, and
> a virtual CPU and disk, if the virtual machines are sharing physical
> components like processor and physical storage medium, they compete for
> resources at the physical level. Even if you have the VM on a single host,
> or on a multi-core host with multiple disks and they are sharing as few
> resources as possible, there will still be a performance hit when the VM
> information has to pass through the hypervisor layer - co-scheduling
> resources with the host and things like that.
>
> Does that make sense?
>
> It's generally accepted that due to these issues, Hadoop in virtual
> environments does not offer the same performance benefits as a physical
> Hadoop cluster. It can be used pretty well with even low-quality hardware
> though, so so, maybe you can acquire some used desktops and install your
> favorite Linux flavor on them and make a cluster - some people have even
> run Hadoop on Raspberry Pi clusters.
>
>
> *Devin Suiter*
> Jr. Data Solutions Software Engineer
> 100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
> Google Voice: 412-256-8556 | www.rdx.com
>
>
> On Tue, Dec 17, 2013 at 6:26 AM, Kandoi, Nikhil <Ni...@emc.com>wrote:
>
>> I know this foolish of me to ask this, because there are a lot of factors
>> that affect this,
>>
>> but why is it taking so much time, can anyone suggest possible reasons
>> for it, or if anyone has faced such issue before
>>
>>
>>
>> Thanks,
>>
>> Nikhil Kandoi
>>
>>  P.S – I am  Hadoop-1.0.3  for this application, so I wonder if this
>> version has got something to do with it.
>>
>>
>>
>> *From:* Azuryy Yu [mailto:azuryyyu@gmail.com]
>> *Sent:* Tuesday, December 17, 2013 4:14 PM
>> *To:* user@hadoop.apache.org
>> *Subject:* Re: Estimating the time of my hadoop jobs
>>
>>
>>
>> Hi Kandoi,
>>
>> It depends on:
>>
>> how many cores on each VNode
>>
>> how complicated of your analysis application
>>
>>
>>
>> But I don't think it's normal spent 3hr to process 30GB data even on your
>> *not good* hareware.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On Tue, Dec 17, 2013 at 6:39 PM, Kandoi, Nikhil <Ni...@emc.com>
>> wrote:
>>
>> Hello everyone,
>>
>>
>>
>> I am new to Hadoop and would like to see if I’m on the right track.
>>
>> Currently I’m developing an application which would ingest logs of order
>> of 60-70 GB of data/day and would then do
>>
>> Some analysis on them
>>
>> Now the infrastructure that I have is a 4 node cluster( all nodes on
>> Virtual Machines) , all nodes have 4GB ram.
>>
>>
>>
>> But when I try to run the dataset (which is a sample dataset at this
>> point ) of about 30 GB, it takes about 3 hrs to process all of it.
>>
>>
>>
>> I would like to know is it normal for this kind of infrastructure to take
>> this amount of time.
>>
>>
>>
>>
>>
>> Thank you
>>
>>
>>
>> Nikhil Kandoi/
>>
>>
>>
>
>

Re: Estimating the time of my hadoop jobs

Posted by Shekhar Sharma <sh...@gmail.com>.

Apart from what Devin has suggested there are other factors which could be
worth while noting when you are running your hadoop cluster on virtual
machines.

(1) How many map and reduce slots are there in cluster?

 Since you have not mentioned and you are using 4 node hadoop cluster so
total of 8map slots and 8 reduce slots are present.
What does it mean?
It means that at a time on your cluster only 8 map tasks and 8 reduce task
will run parallely and other task have to wait..


(2) Since you have not mentioned anywhere that whether 30GB of data is made
up of lot of smaller files ( less than block size) or bigger file...let us
do a simple calculation assuming only one file of 30GB and assuming a block
size of 64MB

30GB = 30 * 1024 * 1024* 1024 = 32212254720

64MB = 64 * 1024*1024 =67108864


Total Number of blocks the data will be broken  = (32212254720) /
(67108864) = 480 Blocks

Now this means you will be running 480 Map tasks ( keeping in mind
inputsplit size = block size)...But since you have only 8 map slots so at a
time only 8 map task will run and others will be pending...

Assuming all the 8map tasks finishes at one time then you will have 480/8 =
60 map waves

 (3) Now you know that each task runs on a separate JVM, that means to say
for every task a jvm is created and then after the task is finished the JVM
is tear down..this is also a bottle neck, creation and destroy of JVM

So try reusing the same JVM. There is option where in you can reuse the JVM

(4) SInce you are working with such  big data, try using combiner?

(5) Also try compressing the data and the intermediate output of the
mappers and reducer op
   ---First try with sequence file
   ---Then try with snappy compression codec


By the above pointers if you can bring down the timings to atleast 1 hour
or so..
Then with the same 4 node cluster and Hadoop running on separate physical
machine you will for sure see the job getting completed in 15-30minutes..[
Please refer Devin's comments ]



My suggestion is get the optimal performance on your virtual machine and
then you go for real hadoop cluster. You will for sure see the performance
improvement



Regards,
Som Shekhar Sharma
+91-8197243810


On Tue, Dec 17, 2013 at 6:42 PM, Devin Suiter RDX <ds...@rdx.com> wrote:

> Nikhil,
>
> One of the problems you run into with Hadoop in Virtual Machine
> environments is performance issues when they are all running on the same
> physical host. With a VM, even though you are giving them 4 GB of RAM, and
> a virtual CPU and disk, if the virtual machines are sharing physical
> components like processor and physical storage medium, they compete for
> resources at the physical level. Even if you have the VM on a single host,
> or on a multi-core host with multiple disks and they are sharing as few
> resources as possible, there will still be a performance hit when the VM
> information has to pass through the hypervisor layer - co-scheduling
> resources with the host and things like that.
>
> Does that make sense?
>
> It's generally accepted that due to these issues, Hadoop in virtual
> environments does not offer the same performance benefits as a physical
> Hadoop cluster. It can be used pretty well with even low-quality hardware
> though, so so, maybe you can acquire some used desktops and install your
> favorite Linux flavor on them and make a cluster - some people have even
> run Hadoop on Raspberry Pi clusters.
>
>
> *Devin Suiter*
> Jr. Data Solutions Software Engineer
> 100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
> Google Voice: 412-256-8556 | www.rdx.com
>
>
> On Tue, Dec 17, 2013 at 6:26 AM, Kandoi, Nikhil <Ni...@emc.com>wrote:
>
>> I know this foolish of me to ask this, because there are a lot of factors
>> that affect this,
>>
>> but why is it taking so much time, can anyone suggest possible reasons
>> for it, or if anyone has faced such issue before
>>
>>
>>
>> Thanks,
>>
>> Nikhil Kandoi
>>
>>  P.S – I am  Hadoop-1.0.3  for this application, so I wonder if this
>> version has got something to do with it.
>>
>>
>>
>> *From:* Azuryy Yu [mailto:azuryyyu@gmail.com]
>> *Sent:* Tuesday, December 17, 2013 4:14 PM
>> *To:* user@hadoop.apache.org
>> *Subject:* Re: Estimating the time of my hadoop jobs
>>
>>
>>
>> Hi Kandoi,
>>
>> It depends on:
>>
>> how many cores on each VNode
>>
>> how complicated of your analysis application
>>
>>
>>
>> But I don't think it's normal spent 3hr to process 30GB data even on your
>> *not good* hareware.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On Tue, Dec 17, 2013 at 6:39 PM, Kandoi, Nikhil <Ni...@emc.com>
>> wrote:
>>
>> Hello everyone,
>>
>>
>>
>> I am new to Hadoop and would like to see if I’m on the right track.
>>
>> Currently I’m developing an application which would ingest logs of order
>> of 60-70 GB of data/day and would then do
>>
>> Some analysis on them
>>
>> Now the infrastructure that I have is a 4 node cluster( all nodes on
>> Virtual Machines) , all nodes have 4GB ram.
>>
>>
>>
>> But when I try to run the dataset (which is a sample dataset at this
>> point ) of about 30 GB, it takes about 3 hrs to process all of it.
>>
>>
>>
>> I would like to know is it normal for this kind of infrastructure to take
>> this amount of time.
>>
>>
>>
>>
>>
>> Thank you
>>
>>
>>
>> Nikhil Kandoi/
>>
>>
>>
>
>

Re: Estimating the time of my hadoop jobs

Posted by Devin Suiter RDX <ds...@rdx.com>.

Nikhil,

One of the problems you run into with Hadoop in Virtual Machine
environments is performance issues when they are all running on the same
physical host. With a VM, even though you are giving them 4 GB of RAM, and
a virtual CPU and disk, if the virtual machines are sharing physical
components like processor and physical storage medium, they compete for
resources at the physical level. Even if you have the VM on a single host,
or on a multi-core host with multiple disks and they are sharing as few
resources as possible, there will still be a performance hit when the VM
information has to pass through the hypervisor layer - co-scheduling
resources with the host and things like that.

Does that make sense?

It's generally accepted that due to these issues, Hadoop in virtual
environments does not offer the same performance benefits as a physical
Hadoop cluster. It can be used pretty well with even low-quality hardware
though, so so, maybe you can acquire some used desktops and install your
favorite Linux flavor on them and make a cluster - some people have even
run Hadoop on Raspberry Pi clusters.

*Devin Suiter*
Jr. Data Solutions Software Engineer
100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
Google Voice: 412-256-8556 | www.rdx.com

On Tue, Dec 17, 2013 at 6:26 AM, Kandoi, Nikhil <Ni...@emc.com>wrote:

> I know this foolish of me to ask this, because there are a lot of factors
> that affect this,
>
> but why is it taking so much time, can anyone suggest possible reasons for
> it, or if anyone has faced such issue before
>
>
>
> Thanks,
>
> Nikhil Kandoi
>
> P.S – I am  Hadoop-1.0.3  for this application, so I wonder if this
> version has got something to do with it.
>
>
>
> *From:* Azuryy Yu [mailto:azuryyyu@gmail.com]
> *Sent:* Tuesday, December 17, 2013 4:14 PM
> *To:* user@hadoop.apache.org
> *Subject:* Re: Estimating the time of my hadoop jobs
>
>
>
> Hi Kandoi,
>
> It depends on:
>
> how many cores on each VNode
>
> how complicated of your analysis application
>
>
>
> But I don't think it's normal spent 3hr to process 30GB data even on your
> *not good* hareware.
>
>
>
>
>
>
>
>
>
>
>
> On Tue, Dec 17, 2013 at 6:39 PM, Kandoi, Nikhil <Ni...@emc.com>
> wrote:
>
> Hello everyone,
>
>
>
> I am new to Hadoop and would like to see if I’m on the right track.
>
> Currently I’m developing an application which would ingest logs of order
> of 60-70 GB of data/day and would then do
>
> Some analysis on them
>
> Now the infrastructure that I have is a 4 node cluster( all nodes on
> Virtual Machines) , all nodes have 4GB ram.
>
>
>
> But when I try to run the dataset (which is a sample dataset at this point
> ) of about 30 GB, it takes about 3 hrs to process all of it.
>
>
>
> I would like to know is it normal for this kind of infrastructure to take
> this amount of time.
>
>
>
>
>
> Thank you
>
>
>
> Nikhil Kandoi/
>
>
>

Re: Estimating the time of my hadoop jobs

Posted by Devin Suiter RDX <ds...@rdx.com>.

Nikhil,

One of the problems you run into with Hadoop in Virtual Machine
environments is performance issues when they are all running on the same
physical host. With a VM, even though you are giving them 4 GB of RAM, and
a virtual CPU and disk, if the virtual machines are sharing physical
components like processor and physical storage medium, they compete for
resources at the physical level. Even if you have the VM on a single host,
or on a multi-core host with multiple disks and they are sharing as few
resources as possible, there will still be a performance hit when the VM
information has to pass through the hypervisor layer - co-scheduling
resources with the host and things like that.

Does that make sense?

It's generally accepted that due to these issues, Hadoop in virtual
environments does not offer the same performance benefits as a physical
Hadoop cluster. It can be used pretty well with even low-quality hardware
though, so so, maybe you can acquire some used desktops and install your
favorite Linux flavor on them and make a cluster - some people have even
run Hadoop on Raspberry Pi clusters.

*Devin Suiter*
Jr. Data Solutions Software Engineer
100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
Google Voice: 412-256-8556 | www.rdx.com

On Tue, Dec 17, 2013 at 6:26 AM, Kandoi, Nikhil <Ni...@emc.com>wrote:

> I know this foolish of me to ask this, because there are a lot of factors
> that affect this,
>
> but why is it taking so much time, can anyone suggest possible reasons for
> it, or if anyone has faced such issue before
>
>
>
> Thanks,
>
> Nikhil Kandoi
>
> P.S – I am  Hadoop-1.0.3  for this application, so I wonder if this
> version has got something to do with it.
>
>
>
> *From:* Azuryy Yu [mailto:azuryyyu@gmail.com]
> *Sent:* Tuesday, December 17, 2013 4:14 PM
> *To:* user@hadoop.apache.org
> *Subject:* Re: Estimating the time of my hadoop jobs
>
>
>
> Hi Kandoi,
>
> It depends on:
>
> how many cores on each VNode
>
> how complicated of your analysis application
>
>
>
> But I don't think it's normal spent 3hr to process 30GB data even on your
> *not good* hareware.
>
>
>
>
>
>
>
>
>
>
>
> On Tue, Dec 17, 2013 at 6:39 PM, Kandoi, Nikhil <Ni...@emc.com>
> wrote:
>
> Hello everyone,
>
>
>
> I am new to Hadoop and would like to see if I’m on the right track.
>
> Currently I’m developing an application which would ingest logs of order
> of 60-70 GB of data/day and would then do
>
> Some analysis on them
>
> Now the infrastructure that I have is a 4 node cluster( all nodes on
> Virtual Machines) , all nodes have 4GB ram.
>
>
>
> But when I try to run the dataset (which is a sample dataset at this point
> ) of about 30 GB, it takes about 3 hrs to process all of it.
>
>
>
> I would like to know is it normal for this kind of infrastructure to take
> this amount of time.
>
>
>
>
>
> Thank you
>
>
>
> Nikhil Kandoi/
>
>
>

Re: Estimating the time of my hadoop jobs

Posted by Devin Suiter RDX <ds...@rdx.com>.

Nikhil,

One of the problems you run into with Hadoop in Virtual Machine
environments is performance issues when they are all running on the same
physical host. With a VM, even though you are giving them 4 GB of RAM, and
a virtual CPU and disk, if the virtual machines are sharing physical
components like processor and physical storage medium, they compete for
resources at the physical level. Even if you have the VM on a single host,
or on a multi-core host with multiple disks and they are sharing as few
resources as possible, there will still be a performance hit when the VM
information has to pass through the hypervisor layer - co-scheduling
resources with the host and things like that.

Does that make sense?

It's generally accepted that due to these issues, Hadoop in virtual
environments does not offer the same performance benefits as a physical
Hadoop cluster. It can be used pretty well with even low-quality hardware
though, so so, maybe you can acquire some used desktops and install your
favorite Linux flavor on them and make a cluster - some people have even
run Hadoop on Raspberry Pi clusters.

*Devin Suiter*
Jr. Data Solutions Software Engineer
100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
Google Voice: 412-256-8556 | www.rdx.com

On Tue, Dec 17, 2013 at 6:26 AM, Kandoi, Nikhil <Ni...@emc.com>wrote:

> I know this foolish of me to ask this, because there are a lot of factors
> that affect this,
>
> but why is it taking so much time, can anyone suggest possible reasons for
> it, or if anyone has faced such issue before
>
>
>
> Thanks,
>
> Nikhil Kandoi
>
> P.S – I am  Hadoop-1.0.3  for this application, so I wonder if this
> version has got something to do with it.
>
>
>
> *From:* Azuryy Yu [mailto:azuryyyu@gmail.com]
> *Sent:* Tuesday, December 17, 2013 4:14 PM
> *To:* user@hadoop.apache.org
> *Subject:* Re: Estimating the time of my hadoop jobs
>
>
>
> Hi Kandoi,
>
> It depends on:
>
> how many cores on each VNode
>
> how complicated of your analysis application
>
>
>
> But I don't think it's normal spent 3hr to process 30GB data even on your
> *not good* hareware.
>
>
>
>
>
>
>
>
>
>
>
> On Tue, Dec 17, 2013 at 6:39 PM, Kandoi, Nikhil <Ni...@emc.com>
> wrote:
>
> Hello everyone,
>
>
>
> I am new to Hadoop and would like to see if I’m on the right track.
>
> Currently I’m developing an application which would ingest logs of order
> of 60-70 GB of data/day and would then do
>
> Some analysis on them
>
> Now the infrastructure that I have is a 4 node cluster( all nodes on
> Virtual Machines) , all nodes have 4GB ram.
>
>
>
> But when I try to run the dataset (which is a sample dataset at this point
> ) of about 30 GB, it takes about 3 hrs to process all of it.
>
>
>
> I would like to know is it normal for this kind of infrastructure to take
> this amount of time.
>
>
>
>
>
> Thank you
>
>
>
> Nikhil Kandoi/
>
>
>

Re: Estimating the time of my hadoop jobs

Posted by Devin Suiter RDX <ds...@rdx.com>.

Nikhil,

One of the problems you run into with Hadoop in Virtual Machine
environments is performance issues when they are all running on the same
physical host. With a VM, even though you are giving them 4 GB of RAM, and
a virtual CPU and disk, if the virtual machines are sharing physical
components like processor and physical storage medium, they compete for
resources at the physical level. Even if you have the VM on a single host,
or on a multi-core host with multiple disks and they are sharing as few
resources as possible, there will still be a performance hit when the VM
information has to pass through the hypervisor layer - co-scheduling
resources with the host and things like that.

Does that make sense?

It's generally accepted that due to these issues, Hadoop in virtual
environments does not offer the same performance benefits as a physical
Hadoop cluster. It can be used pretty well with even low-quality hardware
though, so so, maybe you can acquire some used desktops and install your
favorite Linux flavor on them and make a cluster - some people have even
run Hadoop on Raspberry Pi clusters.

*Devin Suiter*
Jr. Data Solutions Software Engineer
100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
Google Voice: 412-256-8556 | www.rdx.com

On Tue, Dec 17, 2013 at 6:26 AM, Kandoi, Nikhil <Ni...@emc.com>wrote:

> I know this foolish of me to ask this, because there are a lot of factors
> that affect this,
>
> but why is it taking so much time, can anyone suggest possible reasons for
> it, or if anyone has faced such issue before
>
>
>
> Thanks,
>
> Nikhil Kandoi
>
> P.S – I am  Hadoop-1.0.3  for this application, so I wonder if this
> version has got something to do with it.
>
>
>
> *From:* Azuryy Yu [mailto:azuryyyu@gmail.com]
> *Sent:* Tuesday, December 17, 2013 4:14 PM
> *To:* user@hadoop.apache.org
> *Subject:* Re: Estimating the time of my hadoop jobs
>
>
>
> Hi Kandoi,
>
> It depends on:
>
> how many cores on each VNode
>
> how complicated of your analysis application
>
>
>
> But I don't think it's normal spent 3hr to process 30GB data even on your
> *not good* hareware.
>
>
>
>
>
>
>
>
>
>
>
> On Tue, Dec 17, 2013 at 6:39 PM, Kandoi, Nikhil <Ni...@emc.com>
> wrote:
>
> Hello everyone,
>
>
>
> I am new to Hadoop and would like to see if I’m on the right track.
>
> Currently I’m developing an application which would ingest logs of order
> of 60-70 GB of data/day and would then do
>
> Some analysis on them
>
> Now the infrastructure that I have is a 4 node cluster( all nodes on
> Virtual Machines) , all nodes have 4GB ram.
>
>
>
> But when I try to run the dataset (which is a sample dataset at this point
> ) of about 30 GB, it takes about 3 hrs to process all of it.
>
>
>
> I would like to know is it normal for this kind of infrastructure to take
> this amount of time.
>
>
>
>
>
> Thank you
>
>
>
> Nikhil Kandoi/
>
>
>

RE: Estimating the time of my hadoop jobs

Posted by "Kandoi, Nikhil" <Ni...@emc.com>.

I know this foolish of me to ask this, because there are a lot of factors that affect this,
but why is it taking so much time, can anyone suggest possible reasons for it, or if anyone has faced such issue before

Thanks,
Nikhil Kandoi
P.S - I am  Hadoop-1.0.3  for this application, so I wonder if this version has got something to do with it.

From: Azuryy Yu [mailto:azuryyyu@gmail.com]
Sent: Tuesday, December 17, 2013 4:14 PM
To: user@hadoop.apache.org
Subject: Re: Estimating the time of my hadoop jobs

Hi Kandoi,
It depends on:
how many cores on each VNode
how complicated of your analysis application

But I don't think it's normal spent 3hr to process 30GB data even on your *not good* hareware.

On Tue, Dec 17, 2013 at 6:39 PM, Kandoi, Nikhil <Ni...@emc.com>> wrote:
Hello everyone,

I am new to Hadoop and would like to see if I'm on the right track.
Currently I'm developing an application which would ingest logs of order of 60-70 GB of data/day and would then do
Some analysis on them
Now the infrastructure that I have is a 4 node cluster( all nodes on Virtual Machines) , all nodes have 4GB ram.

But when I try to run the dataset (which is a sample dataset at this point ) of about 30 GB, it takes about 3 hrs to process all of it.

I would like to know is it normal for this kind of infrastructure to take this amount of time.

Thank you

Nikhil Kandoi/

RE: Estimating the time of my hadoop jobs

Posted by "Kandoi, Nikhil" <Ni...@emc.com>.

I know this foolish of me to ask this, because there are a lot of factors that affect this,
but why is it taking so much time, can anyone suggest possible reasons for it, or if anyone has faced such issue before

Thanks,
Nikhil Kandoi
P.S - I am  Hadoop-1.0.3  for this application, so I wonder if this version has got something to do with it.

From: Azuryy Yu [mailto:azuryyyu@gmail.com]
Sent: Tuesday, December 17, 2013 4:14 PM
To: user@hadoop.apache.org
Subject: Re: Estimating the time of my hadoop jobs

Hi Kandoi,
It depends on:
how many cores on each VNode
how complicated of your analysis application

But I don't think it's normal spent 3hr to process 30GB data even on your *not good* hareware.

On Tue, Dec 17, 2013 at 6:39 PM, Kandoi, Nikhil <Ni...@emc.com>> wrote:
Hello everyone,

I am new to Hadoop and would like to see if I'm on the right track.
Currently I'm developing an application which would ingest logs of order of 60-70 GB of data/day and would then do
Some analysis on them
Now the infrastructure that I have is a 4 node cluster( all nodes on Virtual Machines) , all nodes have 4GB ram.

But when I try to run the dataset (which is a sample dataset at this point ) of about 30 GB, it takes about 3 hrs to process all of it.

I would like to know is it normal for this kind of infrastructure to take this amount of time.

Thank you

Nikhil Kandoi/

RE: Estimating the time of my hadoop jobs

Posted by "Kandoi, Nikhil" <Ni...@emc.com>.

I know this foolish of me to ask this, because there are a lot of factors that affect this,
but why is it taking so much time, can anyone suggest possible reasons for it, or if anyone has faced such issue before

Thanks,
Nikhil Kandoi
P.S - I am  Hadoop-1.0.3  for this application, so I wonder if this version has got something to do with it.

From: Azuryy Yu [mailto:azuryyyu@gmail.com]
Sent: Tuesday, December 17, 2013 4:14 PM
To: user@hadoop.apache.org
Subject: Re: Estimating the time of my hadoop jobs

Hi Kandoi,
It depends on:
how many cores on each VNode
how complicated of your analysis application

But I don't think it's normal spent 3hr to process 30GB data even on your *not good* hareware.

On Tue, Dec 17, 2013 at 6:39 PM, Kandoi, Nikhil <Ni...@emc.com>> wrote:
Hello everyone,

I am new to Hadoop and would like to see if I'm on the right track.
Currently I'm developing an application which would ingest logs of order of 60-70 GB of data/day and would then do
Some analysis on them
Now the infrastructure that I have is a 4 node cluster( all nodes on Virtual Machines) , all nodes have 4GB ram.

But when I try to run the dataset (which is a sample dataset at this point ) of about 30 GB, it takes about 3 hrs to process all of it.

I would like to know is it normal for this kind of infrastructure to take this amount of time.

Thank you

Nikhil Kandoi/

RE: Estimating the time of my hadoop jobs

Posted by "Kandoi, Nikhil" <Ni...@emc.com>.

I know this foolish of me to ask this, because there are a lot of factors that affect this,
but why is it taking so much time, can anyone suggest possible reasons for it, or if anyone has faced such issue before

Thanks,
Nikhil Kandoi
P.S - I am  Hadoop-1.0.3  for this application, so I wonder if this version has got something to do with it.

From: Azuryy Yu [mailto:azuryyyu@gmail.com]
Sent: Tuesday, December 17, 2013 4:14 PM
To: user@hadoop.apache.org
Subject: Re: Estimating the time of my hadoop jobs

Hi Kandoi,
It depends on:
how many cores on each VNode
how complicated of your analysis application

But I don't think it's normal spent 3hr to process 30GB data even on your *not good* hareware.

On Tue, Dec 17, 2013 at 6:39 PM, Kandoi, Nikhil <Ni...@emc.com>> wrote:
Hello everyone,

I am new to Hadoop and would like to see if I'm on the right track.
Currently I'm developing an application which would ingest logs of order of 60-70 GB of data/day and would then do
Some analysis on them
Now the infrastructure that I have is a 4 node cluster( all nodes on Virtual Machines) , all nodes have 4GB ram.

But when I try to run the dataset (which is a sample dataset at this point ) of about 30 GB, it takes about 3 hrs to process all of it.

I would like to know is it normal for this kind of infrastructure to take this amount of time.

Thank you

Nikhil Kandoi/

Re: Estimating the time of my hadoop jobs

Posted by Azuryy Yu <az...@gmail.com>.

Hi Kandoi,
It depends on:
how many cores on each VNode
how complicated of your analysis application

But I don't think it's normal spent 3hr to process 30GB data even on your
*not good* hareware.






On Tue, Dec 17, 2013 at 6:39 PM, Kandoi, Nikhil <Ni...@emc.com>wrote:

> Hello everyone,
>
>
>
> I am new to Hadoop and would like to see if I’m on the right track.
>
> Currently I’m developing an application which would ingest logs of order
> of 60-70 GB of data/day and would then do
>
> Some analysis on them
>
> Now the infrastructure that I have is a 4 node cluster( all nodes on
> Virtual Machines) , all nodes have 4GB ram.
>
>
>
> But when I try to run the dataset (which is a sample dataset at this point
> ) of about 30 GB, it takes about 3 hrs to process all of it.
>
>
>
> I would like to know is it normal for this kind of infrastructure to take
> this amount of time.
>
>
>
>
>
> Thank you
>
>
>
> Nikhil Kandoi/
>

Re: Estimating the time of my hadoop jobs

Posted by Azuryy Yu <az...@gmail.com>.

Hi Kandoi,
It depends on:
how many cores on each VNode
how complicated of your analysis application

But I don't think it's normal spent 3hr to process 30GB data even on your
*not good* hareware.






On Tue, Dec 17, 2013 at 6:39 PM, Kandoi, Nikhil <Ni...@emc.com>wrote:

> Hello everyone,
>
>
>
> I am new to Hadoop and would like to see if I’m on the right track.
>
> Currently I’m developing an application which would ingest logs of order
> of 60-70 GB of data/day and would then do
>
> Some analysis on them
>
> Now the infrastructure that I have is a 4 node cluster( all nodes on
> Virtual Machines) , all nodes have 4GB ram.
>
>
>
> But when I try to run the dataset (which is a sample dataset at this point
> ) of about 30 GB, it takes about 3 hrs to process all of it.
>
>
>
> I would like to know is it normal for this kind of infrastructure to take
> this amount of time.
>
>
>
>
>
> Thank you
>
>
>
> Nikhil Kandoi/
>

Re: Estimating the time of my hadoop jobs

Posted by Azuryy Yu <az...@gmail.com>.

Hi Kandoi,
It depends on:
how many cores on each VNode
how complicated of your analysis application

But I don't think it's normal spent 3hr to process 30GB data even on your
*not good* hareware.






On Tue, Dec 17, 2013 at 6:39 PM, Kandoi, Nikhil <Ni...@emc.com>wrote:

> Hello everyone,
>
>
>
> I am new to Hadoop and would like to see if I’m on the right track.
>
> Currently I’m developing an application which would ingest logs of order
> of 60-70 GB of data/day and would then do
>
> Some analysis on them
>
> Now the infrastructure that I have is a 4 node cluster( all nodes on
> Virtual Machines) , all nodes have 4GB ram.
>
>
>
> But when I try to run the dataset (which is a sample dataset at this point
> ) of about 30 GB, it takes about 3 hrs to process all of it.
>
>
>
> I would like to know is it normal for this kind of infrastructure to take
> this amount of time.
>
>
>
>
>
> Thank you
>
>
>
> Nikhil Kandoi/
>

Re: Estimating the time of my hadoop jobs

Posted by Azuryy Yu <az...@gmail.com>.

Hi Kandoi,
It depends on:
how many cores on each VNode
how complicated of your analysis application

But I don't think it's normal spent 3hr to process 30GB data even on your
*not good* hareware.






On Tue, Dec 17, 2013 at 6:39 PM, Kandoi, Nikhil <Ni...@emc.com>wrote:

> Hello everyone,
>
>
>
> I am new to Hadoop and would like to see if I’m on the right track.
>
> Currently I’m developing an application which would ingest logs of order
> of 60-70 GB of data/day and would then do
>
> Some analysis on them
>
> Now the infrastructure that I have is a 4 node cluster( all nodes on
> Virtual Machines) , all nodes have 4GB ram.
>
>
>
> But when I try to run the dataset (which is a sample dataset at this point
> ) of about 30 GB, it takes about 3 hrs to process all of it.
>
>
>
> I would like to know is it normal for this kind of infrastructure to take
> this amount of time.
>
>
>
>
>
> Thank you
>
>
>
> Nikhil Kandoi/
>