You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-user@hadoop.apache.org by ma...@nissatech.com on 2015/05/12 21:57:36 UTC

Smaller block size for more intense jobs

Hello,

I'm in doubt should I specify the block size to be smaller than 64MB in case 
that my mappers need to do intensive computations?

I know that it is better to have larger files, since the replication and 
NameNode as a weak point, but I'm don't have that much data, but the 
operations that need to be performed on it are intensive.

It looks like it's better to have smaller block size (at least until there is 
more data) so that multiple Mappers get instantiated, so they could share the 
computations.

I'm currently talking about Hadoop 1, not YARN. But a heads up about the same 
problem with YARN will be appreciated.

Thanks,
Marko

Sent with [inky](http://inky.com?kme=signature)

Re: Smaller block size for more intense jobs

Posted by Marko Dinic <ma...@nissatech.com>.

Dear Harshit,

Thank you very much for your answer.

Well, to be fully honest with you, I'm currently given just a sample of 
data (poor 35MB sample), so I could develop the processing, and 
hopefully there will be a lot more, I don't know how much.

I'm not really having the problem with large number of small files (not 
at the moment), but I'm expecting to have normal size files, or even 
read that data from a database such as Cassandra (I'm not sure if this 
is going to work and how).

The thing that I know is that there is some heavy processing of data, 
since we're talking about some machine learning (data mining) 
algorithms, such as clustering, where there may be a number of steps 
and iterations. In one of the mappers for example I'm calculating 
similarities between some signals, which may be intensive.

So, to wrap it up, I'm not sure how big the data will be, but I know 
that processing is intensive, and currently a bit slow in my opinion - 
the complete algorithm contains a number of steps and iterations, as I 
said, but it lasts for a couple of hours for the 35MB dataset, which 
worries me.

What do you, or anyone else willing to get into discussion, think?

Best regards,
Marko

On Wed 13 May 2015 06:17:58 AM CEST, Harshit Mathur wrote:
> Hi Marko,
>
> If your files are very small (less than the block size) then a lot of
> map tasks will get executed, but as the initialization and overheads
> degrades the overall performance, so it might appear that the single
> map is executing very fast but the overall job execution will take
> more time.
>
> I was having a similar problem where the data files were huge in
> number but the size of a single file was much lesser than the block
> size, and due to this a large number of maps were executed by the
> framework. This was taking a great amount of time in overall job
> execution, so to overcome this issue, we used Combined file input
> format, this handles the input split efficiently and an optimum number
> of maps are executed, and thus the overall job execution improves
> drastically.
>
> Can you give some info about the size of data and the logic for
> processing in the map function, it will help me understand your issue
> more.
>
> BR,
> Harshit Mathur
>
> On Wed, May 13, 2015 at 1:27 AM, <marko.dinic@nissatech.com
> <ma...@nissatech.com>> wrote:
>
>     Hello,
>
>     I'm in doubt should I specify the block size to be smaller than
>     64MB in case that my mappers need to do intensive computations?
>
>     I know that it is better to have larger files, since the
>     replication and NameNode as a weak point, but I'm don't have that
>     much data, but the operations that need to be performed on it are
>     intensive.
>
>     It looks like it's better to have smaller block size (at least
>     until there is more data) so that multiple Mappers get
>     instantiated, so they could share the computations.
>
>     I'm currently talking about Hadoop 1, not YARN. But a heads up
>     about the same problem with YARN will be appreciated.
>
>     Thanks,
>
>     Marko
>
>     Sent with inky <http://inky.com?kme=signature>
>
>
>
>
> --
> Harshit Mathur

Re: Smaller block size for more intense jobs

Posted by Marko Dinic <ma...@nissatech.com>.

Dear Harshit,

Thank you very much for your answer.

Well, to be fully honest with you, I'm currently given just a sample of 
data (poor 35MB sample), so I could develop the processing, and 
hopefully there will be a lot more, I don't know how much.

I'm not really having the problem with large number of small files (not 
at the moment), but I'm expecting to have normal size files, or even 
read that data from a database such as Cassandra (I'm not sure if this 
is going to work and how).

The thing that I know is that there is some heavy processing of data, 
since we're talking about some machine learning (data mining) 
algorithms, such as clustering, where there may be a number of steps 
and iterations. In one of the mappers for example I'm calculating 
similarities between some signals, which may be intensive.

So, to wrap it up, I'm not sure how big the data will be, but I know 
that processing is intensive, and currently a bit slow in my opinion - 
the complete algorithm contains a number of steps and iterations, as I 
said, but it lasts for a couple of hours for the 35MB dataset, which 
worries me.

What do you, or anyone else willing to get into discussion, think?

Best regards,
Marko

On Wed 13 May 2015 06:17:58 AM CEST, Harshit Mathur wrote:
> Hi Marko,
>
> If your files are very small (less than the block size) then a lot of
> map tasks will get executed, but as the initialization and overheads
> degrades the overall performance, so it might appear that the single
> map is executing very fast but the overall job execution will take
> more time.
>
> I was having a similar problem where the data files were huge in
> number but the size of a single file was much lesser than the block
> size, and due to this a large number of maps were executed by the
> framework. This was taking a great amount of time in overall job
> execution, so to overcome this issue, we used Combined file input
> format, this handles the input split efficiently and an optimum number
> of maps are executed, and thus the overall job execution improves
> drastically.
>
> Can you give some info about the size of data and the logic for
> processing in the map function, it will help me understand your issue
> more.
>
> BR,
> Harshit Mathur
>
> On Wed, May 13, 2015 at 1:27 AM, <marko.dinic@nissatech.com
> <ma...@nissatech.com>> wrote:
>
>     Hello,
>
>     I'm in doubt should I specify the block size to be smaller than
>     64MB in case that my mappers need to do intensive computations?
>
>     I know that it is better to have larger files, since the
>     replication and NameNode as a weak point, but I'm don't have that
>     much data, but the operations that need to be performed on it are
>     intensive.
>
>     It looks like it's better to have smaller block size (at least
>     until there is more data) so that multiple Mappers get
>     instantiated, so they could share the computations.
>
>     I'm currently talking about Hadoop 1, not YARN. But a heads up
>     about the same problem with YARN will be appreciated.
>
>     Thanks,
>
>     Marko
>
>     Sent with inky <http://inky.com?kme=signature>
>
>
>
>
> --
> Harshit Mathur

Re: Smaller block size for more intense jobs

Posted by Marko Dinic <ma...@nissatech.com>.

Dear Harshit,

Thank you very much for your answer.

Well, to be fully honest with you, I'm currently given just a sample of 
data (poor 35MB sample), so I could develop the processing, and 
hopefully there will be a lot more, I don't know how much.

I'm not really having the problem with large number of small files (not 
at the moment), but I'm expecting to have normal size files, or even 
read that data from a database such as Cassandra (I'm not sure if this 
is going to work and how).

The thing that I know is that there is some heavy processing of data, 
since we're talking about some machine learning (data mining) 
algorithms, such as clustering, where there may be a number of steps 
and iterations. In one of the mappers for example I'm calculating 
similarities between some signals, which may be intensive.

So, to wrap it up, I'm not sure how big the data will be, but I know 
that processing is intensive, and currently a bit slow in my opinion - 
the complete algorithm contains a number of steps and iterations, as I 
said, but it lasts for a couple of hours for the 35MB dataset, which 
worries me.

What do you, or anyone else willing to get into discussion, think?

Best regards,
Marko

On Wed 13 May 2015 06:17:58 AM CEST, Harshit Mathur wrote:
> Hi Marko,
>
> If your files are very small (less than the block size) then a lot of
> map tasks will get executed, but as the initialization and overheads
> degrades the overall performance, so it might appear that the single
> map is executing very fast but the overall job execution will take
> more time.
>
> I was having a similar problem where the data files were huge in
> number but the size of a single file was much lesser than the block
> size, and due to this a large number of maps were executed by the
> framework. This was taking a great amount of time in overall job
> execution, so to overcome this issue, we used Combined file input
> format, this handles the input split efficiently and an optimum number
> of maps are executed, and thus the overall job execution improves
> drastically.
>
> Can you give some info about the size of data and the logic for
> processing in the map function, it will help me understand your issue
> more.
>
> BR,
> Harshit Mathur
>
> On Wed, May 13, 2015 at 1:27 AM, <marko.dinic@nissatech.com
> <ma...@nissatech.com>> wrote:
>
>     Hello,
>
>     I'm in doubt should I specify the block size to be smaller than
>     64MB in case that my mappers need to do intensive computations?
>
>     I know that it is better to have larger files, since the
>     replication and NameNode as a weak point, but I'm don't have that
>     much data, but the operations that need to be performed on it are
>     intensive.
>
>     It looks like it's better to have smaller block size (at least
>     until there is more data) so that multiple Mappers get
>     instantiated, so they could share the computations.
>
>     I'm currently talking about Hadoop 1, not YARN. But a heads up
>     about the same problem with YARN will be appreciated.
>
>     Thanks,
>
>     Marko
>
>     Sent with inky <http://inky.com?kme=signature>
>
>
>
>
> --
> Harshit Mathur

Re: Smaller block size for more intense jobs

Posted by Marko Dinic <ma...@nissatech.com>.

Dear Harshit,

Thank you very much for your answer.

Well, to be fully honest with you, I'm currently given just a sample of 
data (poor 35MB sample), so I could develop the processing, and 
hopefully there will be a lot more, I don't know how much.

I'm not really having the problem with large number of small files (not 
at the moment), but I'm expecting to have normal size files, or even 
read that data from a database such as Cassandra (I'm not sure if this 
is going to work and how).

The thing that I know is that there is some heavy processing of data, 
since we're talking about some machine learning (data mining) 
algorithms, such as clustering, where there may be a number of steps 
and iterations. In one of the mappers for example I'm calculating 
similarities between some signals, which may be intensive.

So, to wrap it up, I'm not sure how big the data will be, but I know 
that processing is intensive, and currently a bit slow in my opinion - 
the complete algorithm contains a number of steps and iterations, as I 
said, but it lasts for a couple of hours for the 35MB dataset, which 
worries me.

What do you, or anyone else willing to get into discussion, think?

Best regards,
Marko

On Wed 13 May 2015 06:17:58 AM CEST, Harshit Mathur wrote:
> Hi Marko,
>
> If your files are very small (less than the block size) then a lot of
> map tasks will get executed, but as the initialization and overheads
> degrades the overall performance, so it might appear that the single
> map is executing very fast but the overall job execution will take
> more time.
>
> I was having a similar problem where the data files were huge in
> number but the size of a single file was much lesser than the block
> size, and due to this a large number of maps were executed by the
> framework. This was taking a great amount of time in overall job
> execution, so to overcome this issue, we used Combined file input
> format, this handles the input split efficiently and an optimum number
> of maps are executed, and thus the overall job execution improves
> drastically.
>
> Can you give some info about the size of data and the logic for
> processing in the map function, it will help me understand your issue
> more.
>
> BR,
> Harshit Mathur
>
> On Wed, May 13, 2015 at 1:27 AM, <marko.dinic@nissatech.com
> <ma...@nissatech.com>> wrote:
>
>     Hello,
>
>     I'm in doubt should I specify the block size to be smaller than
>     64MB in case that my mappers need to do intensive computations?
>
>     I know that it is better to have larger files, since the
>     replication and NameNode as a weak point, but I'm don't have that
>     much data, but the operations that need to be performed on it are
>     intensive.
>
>     It looks like it's better to have smaller block size (at least
>     until there is more data) so that multiple Mappers get
>     instantiated, so they could share the computations.
>
>     I'm currently talking about Hadoop 1, not YARN. But a heads up
>     about the same problem with YARN will be appreciated.
>
>     Thanks,
>
>     Marko
>
>     Sent with inky <http://inky.com?kme=signature>
>
>
>
>
> --
> Harshit Mathur

Re: Smaller block size for more intense jobs

Posted by Harshit Mathur <ma...@gmail.com>.

Hi Marko,

If your files are very small (less than the block size) then a lot of map
tasks will get executed, but as the initialization and overheads degrades
the overall performance, so it might appear that the single map is
executing very fast but the overall job execution will take more time.

I was having a similar problem where the data files were huge in number but
the size of a single file was much lesser than the block size, and due to
this a large number of maps were executed by the framework. This was taking
a great amount of time in overall job execution, so to overcome this issue,
we used Combined file input format, this handles the input split
efficiently and an optimum number of maps are executed, and thus the
overall job execution improves drastically.

Can you give some info about the size of data and the logic for processing
in the map function, it will help me understand your issue more.

BR,
Harshit Mathur

On Wed, May 13, 2015 at 1:27 AM, <ma...@nissatech.com> wrote:

>  Hello,
>
>
>
> I'm in doubt should I specify the block size to be smaller than 64MB in
> case that my mappers need to do intensive computations?
>
>
>
> I know that it is better to have larger files, since the replication and
> NameNode as a weak point, but I'm don't have that much data, but the
> operations that need to be performed on it are intensive.
>
>
>
> It looks like it's better to have smaller block size (at least until there
> is more data) so that multiple Mappers get instantiated, so they could
> share the computations.
>
>
>
> I'm currently talking about Hadoop 1, not YARN. But a heads up about the
> same problem with YARN will be appreciated.
>
>
>
> Thanks,
>
> Marko
>
>
>   Sent with inky <http://inky.com?kme=signature>
>
>

-- 
Harshit Mathur

Re: Smaller block size for more intense jobs

Posted by Harshit Mathur <ma...@gmail.com>.

Hi Marko,

If your files are very small (less than the block size) then a lot of map
tasks will get executed, but as the initialization and overheads degrades
the overall performance, so it might appear that the single map is
executing very fast but the overall job execution will take more time.

I was having a similar problem where the data files were huge in number but
the size of a single file was much lesser than the block size, and due to
this a large number of maps were executed by the framework. This was taking
a great amount of time in overall job execution, so to overcome this issue,
we used Combined file input format, this handles the input split
efficiently and an optimum number of maps are executed, and thus the
overall job execution improves drastically.

Can you give some info about the size of data and the logic for processing
in the map function, it will help me understand your issue more.

BR,
Harshit Mathur

On Wed, May 13, 2015 at 1:27 AM, <ma...@nissatech.com> wrote:

>  Hello,
>
>
>
> I'm in doubt should I specify the block size to be smaller than 64MB in
> case that my mappers need to do intensive computations?
>
>
>
> I know that it is better to have larger files, since the replication and
> NameNode as a weak point, but I'm don't have that much data, but the
> operations that need to be performed on it are intensive.
>
>
>
> It looks like it's better to have smaller block size (at least until there
> is more data) so that multiple Mappers get instantiated, so they could
> share the computations.
>
>
>
> I'm currently talking about Hadoop 1, not YARN. But a heads up about the
> same problem with YARN will be appreciated.
>
>
>
> Thanks,
>
> Marko
>
>
>   Sent with inky <http://inky.com?kme=signature>
>
>

-- 
Harshit Mathur

Re: Smaller block size for more intense jobs

Posted by Harshit Mathur <ma...@gmail.com>.

Hi Marko,

If your files are very small (less than the block size) then a lot of map
tasks will get executed, but as the initialization and overheads degrades
the overall performance, so it might appear that the single map is
executing very fast but the overall job execution will take more time.

I was having a similar problem where the data files were huge in number but
the size of a single file was much lesser than the block size, and due to
this a large number of maps were executed by the framework. This was taking
a great amount of time in overall job execution, so to overcome this issue,
we used Combined file input format, this handles the input split
efficiently and an optimum number of maps are executed, and thus the
overall job execution improves drastically.

Can you give some info about the size of data and the logic for processing
in the map function, it will help me understand your issue more.

BR,
Harshit Mathur

On Wed, May 13, 2015 at 1:27 AM, <ma...@nissatech.com> wrote:

>  Hello,
>
>
>
> I'm in doubt should I specify the block size to be smaller than 64MB in
> case that my mappers need to do intensive computations?
>
>
>
> I know that it is better to have larger files, since the replication and
> NameNode as a weak point, but I'm don't have that much data, but the
> operations that need to be performed on it are intensive.
>
>
>
> It looks like it's better to have smaller block size (at least until there
> is more data) so that multiple Mappers get instantiated, so they could
> share the computations.
>
>
>
> I'm currently talking about Hadoop 1, not YARN. But a heads up about the
> same problem with YARN will be appreciated.
>
>
>
> Thanks,
>
> Marko
>
>
>   Sent with inky <http://inky.com?kme=signature>
>
>

-- 
Harshit Mathur

Re: Smaller block size for more intense jobs

Posted by Harshit Mathur <ma...@gmail.com>.

Hi Marko,

If your files are very small (less than the block size) then a lot of map
tasks will get executed, but as the initialization and overheads degrades
the overall performance, so it might appear that the single map is
executing very fast but the overall job execution will take more time.

I was having a similar problem where the data files were huge in number but
the size of a single file was much lesser than the block size, and due to
this a large number of maps were executed by the framework. This was taking
a great amount of time in overall job execution, so to overcome this issue,
we used Combined file input format, this handles the input split
efficiently and an optimum number of maps are executed, and thus the
overall job execution improves drastically.

Can you give some info about the size of data and the logic for processing
in the map function, it will help me understand your issue more.

BR,
Harshit Mathur

On Wed, May 13, 2015 at 1:27 AM, <ma...@nissatech.com> wrote:

>  Hello,
>
>
>
> I'm in doubt should I specify the block size to be smaller than 64MB in
> case that my mappers need to do intensive computations?
>
>
>
> I know that it is better to have larger files, since the replication and
> NameNode as a weak point, but I'm don't have that much data, but the
> operations that need to be performed on it are intensive.
>
>
>
> It looks like it's better to have smaller block size (at least until there
> is more data) so that multiple Mappers get instantiated, so they could
> share the computations.
>
>
>
> I'm currently talking about Hadoop 1, not YARN. But a heads up about the
> same problem with YARN will be appreciated.
>
>
>
> Thanks,
>
> Marko
>
>
>   Sent with inky <http://inky.com?kme=signature>
>
>

-- 
Harshit Mathur